iterative / dvc

πŸ¦‰ Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.72k stars 1.18k forks source link

Get rid of .dvc files? #4278

Closed jorgeorpinel closed 4 years ago

jorgeorpinel commented 4 years ago

TLDR: Skip rant and go to the last bullet and code blocks.

If we're planning a non-backward compatible change in the near future, maybe it would be a good time to consider completely moving from .dvc files to dvc.yaml/dvc.lock. Or implement and leave the old .dvc file approach as optional for backward compatibility (for some time at least).

Some context in https://github.com/iterative/dvc.org/pull/1384#pullrequestreview-454607407

My motivation to suggest this is mostly conceptual right now but maybe it has some very practical implications too? Leaving that open to discussion (cc @iterative/engineering).

In DVC 1.x we created the pipelines file dvc.yaml which contains all the stages. From that point on .dvc files stopped being "stage files" and they only remain as placeholders for data files. They're no longer considered any kind of stage (we even removed the terms "stage file" and "orphan stage" from docs already). This has caused headaches in docs when explaining commands that use or affect both .dvc AND dvc.yaml or .lock files such as status, checkout, repro (pretty important ones), because we constantly need to mention both "stages and .dvc files" or "dvc.yaml and .dvc file", etc.

The options I see here are:

dvc.yaml

data:
- corpus.csv
- dataset/

stages:
  cleanup:
    cmd: python clean.py corpus.csv df.h5
    deps:
    - corpus.csv
    outs:
    - df.h5
  ...

dvc.lock

outs:
- md5: 6137cde4893c59f76f005a8123d8e8e6
  path: df.h5
- md5: cde4876f0r5137c59f8e6a8423d8e936.dir
  path: dataset/

cleanup:
  cmd: python clean.py corpus.csv df.h5
  deps:
  - path: corpus.csv
    md5: 6137cde4893c59f76f005a8123d8e8e6
  outs:
  - path: df.h5
    md5: f40e3db3e1aa25562945045864a28deb
  ...

Something like that.

An additional advantage of this would be that it reduces the possible confusions between .dvc/ (internal dir) and .dvc (files).

skshetry commented 4 years ago

I have already suggested this on https://github.com/iterative/dvc/issues/3936 with a clear deprecation plan, with the same motivation as yours (simplifying ui/ux and docs).

skshetry commented 4 years ago

@jorgeorpinel, could also say dvc.yaml is a workflow, and .dvc is a DVC file?

jorgeorpinel commented 4 years ago

Ah yes, could be seen as a duplicate of that one Saugat indeed, thanks for pointing it out.

could also say dvc.yaml is a workflow, and .dvc is a DVC file?

We could but the point is I'm trying to avoid having to mention 2 things every time we're explaining the commands that affect both dvc.* and .dvc files... If there was a term that can encompass both (only idea so far is "DVC files") that would be ideal for now. Something like "internal files" or something...

shcheklein commented 4 years ago

My thoughts on this. No matter how do I feel about two types of files and the pain dealing with them, I would say that "I'm trying to avoid having to mention 2 things every time we're explaining the commands that affect both dvc.* and .dvc files" should not be a strong argument when we make DVC core decisions. Usability, use case, user workflow (including things like merge, working with data only) - should come first.

If for some reason it's better to have explicit .dvc when people deal with data - let's keep it. And find some elegant way to describe them in docs.

jorgeorpinel commented 4 years ago

I agree. I don't think that simplifying docs writing itself is a reason at all. But it tends to signal that something conceptually may need more work i.e. the workflow can be simplified for users.

If for some reason it's better to have explicit .dvc when people deal with data - let's keep it.

Great question. I guess it's a well known strategy e.g. used by Git-LFS and others (replace large files with placeholders) and it makes the data somewhat visible when you do ls even before dvc pull or similar.

On the other hand that's what we have dvc list . for too. And since we already went for the dvc.yaml file we could totally commit to that strategy and centralize everything in that almost single metafile to codify the entire DS project.

jorgeorpinel commented 4 years ago

Reopening for now as discussion seems active and more specific here.

antonkulaga commented 4 years ago

I am in favour of having everything in dvc yaml and dvc lock, otherwise you have a state where you have some files tracked by dvc.lock/yaml system and some by .dvc files. Also, .dvc files create a lot of headache for the newcomers to which I share my ML workflow as they often con

bobertlo commented 4 years ago

I personally like having the .dvc files. It makes sense in my head for the workflow. I wish there was like a -g flag in dvc add to automatically add them to the pending commit.

I track a lot of different datasets and I think git mechanics when reverting/merging should be considered.

jorgeorpinel commented 4 years ago

Good points. But currently we have a strange mix of placeholders (.dvc files) for basic tracking of *some* data (as Anton mentioned), with config files (dvc.yaml) that describe a workflow, AND lockfiles to track the *other* data (dvc.lock).

I think we should either go all in for dvc.yaml/lock, or create a new lockfile for .dvc files (not storing file hashes in them). Perhaps the same dvc.lock file could contain everything β€” but at that point why not get rid of .dvc files altogether? Alternatively, getting rid of dvc.lock and creating .dvc files for all data specified in dvc.yaml would also be more consistent.

I wish there was like a -g flag in dvc add to automatically add them to the pending commit.

Good idea, you can create a feature request @bobertlo πŸ™‚

git mechanics when reverting/merging should be considered.

dvc.yaml files are easy to merge. Probably easier than .dvc files right now. Lock files can always be just regenerated. p.s. Ruslan is writing a guide about that, actually, check it out/ leave feedback if you'd like: https://github.com/iterative/dvc.org/pull/1684

casperdcl commented 4 years ago

about the -g, --git flag: seems duplicate of #3446 and #4330. I recall having a lot of discussion about adding this flag to some commands (add, diff, push, ...) but not sure if there was any conclusion.

bobertlo commented 4 years ago

@jorgeorpinel I hate to admit but I think I did musunderstand the relationship between all these files slightly. I must admit that moving a single hash from the .dvc file to the dvc.yaml file would not cause much, if any, distress for users based on git mechanics.

@casperdcl I opened one of those issues after mentioning it here. I think there is another older issue with this mentioned that I saw later but it appeared to have been closed for lack of interest/feedback iirc?

casperdcl commented 4 years ago

I opened one of those issues after mentioning it here. I think there is another older issue with this mentioned that I saw later but it appeared to have been closed for lack of interest/feedback iirc?

yes it's fine to open a new discussion.

karajan1001 commented 4 years ago

In my opinion,

  1. .dvc files are placeholders of data, they are nodes of the computational graph.
  2. While stages represent relationships between data, they are edges of the computational graph.
  3. Both of the nodes and edges are stable in most of my use cases, The only changing thing is the value of the nodes.

Two solutions:

  1. Deprecate dvc.lock, nodes info only stored in .dvc files.
  2. In dvc.yaml, values of deps or outs changed to .dvc nodes.
    deps:
    - corpus.csv
    outs:
    - df.h5

    --->

    deps:
    - corpus.csv.dvc
    outs:
    - df.h5.dvc

    or

  3. Deprecate .dvc files, nodes info all stored in dvc.lock.
  4. In dvc.yaml, values of deps or outs must to be values in dvc.lock.
jorgeorpinel commented 4 years ago

Hmmm interesting analysis @karajan1001. I think the intention is that stages the nodes in DAGs that represent data pipelines though. They're connected by their outputs and dependencies, so that relationship would be the edges.

In any case, the problem persists that dependencies and outputs can be defined in dvc.yaml + dvc.lock, or fully in .dvc files, which is kind of inconsistent, and your solutions do address this πŸ‘

But your first solution idea would basically spread dvc.yaml+.lock into dvc.yaml+multiple lockfiles which seems also somewhat inconsistent.

So I still vote for the original suggestion (your 2nd solution above): let's get rid of .dvc files πŸ˜ƒ

  1. In dvc.yaml, values of deps or outs must to be values in dvc.lock.

p.s. This is already the case πŸ™‚

shcheklein commented 4 years ago

In any case, the problem persists that dependencies and outputs can be defined in dvc.yaml + dvc.lock, or fully in .dvc files

It's not exactly correct though to my mind. .dvc should not be defining outputs and dependencies. They define files and directories.

jorgeorpinel commented 4 years ago

Well, we call them outputs and dependencies though, e.g. external outputs (dvc add --external) and external dependencies (dvc import).

shcheklein commented 4 years ago

Well, we call them outputs and dependencies though, e.g. external outputs (dvc add --external) and external dependencies (dvc import).

Yep, I guess, that's the problem of the terminology and docs.

karajan1001 commented 4 years ago

Well, we call them outputs and dependencies though, e.g. external outputs (dvc add --external) and external dependencies (dvc import).

They are placeholders or nodes, and a node can be at both sides of a directed edge, (dependencies or outputs)

jorgeorpinel commented 4 years ago

.dvc should not be defining outputs and dependencies. They define files and directories. They are placeholders

So even if we revisit the terminology and make it clear that .dvc files are simply placeholders for tracked files/dirs, they still don't cover all tracked files/dirs. Many (often most) are defined in dvc.yaml+lock (as outputs), and again, that's the confusing inconsistency I see which could be easily fixed by adding a raw data section in dvc.yaml/lock instead of using .dvc files.

karajan1001 commented 4 years ago

We have only two kinds of entities, nodes or edges. They used to be all stored in .dvc file mixed. Currently, some of the nodes are stored in .dvc file and some of the nodes are in dvc.lock while all of the edges are in dvc.yaml. So we have to make the nodes storing consistent.

One of the advantages of saving them in a .dvc files is that .dvc files stay at the path where the original data stay. It helps to keep the whole project structure (without these placeholders some path might be empty and can't add to Git) and is more convenient for us to image the structure and explore the paths without downloading all of the data or in a Github or Gitlab website.

shcheklein commented 4 years ago

Many (often most) are defined in dvc.yaml+lock (outputs), and again, that's the confusing inconsistency.

Yes, we made decision somewhat deliberately - kinda trading engineering elegancy (keep everything in the dvc.lock or keep all files/dirs in .dvc) to pipelines functionality simplicity. We decided to keep .dvc still to keep it simple and explicit for users that do not use/need pipelines.

There is a clear difference though between those files/directories - those involved into pipelines are outputs.

jorgeorpinel commented 4 years ago

OK. And doesn't a data: section in dvc.yaml achieve the same goal? I.e. do you oppose the proposed change, @shcheklein?

If it achieves the same result, and improves the consistency, making DVC less confusing and easier to understand, I see that as an improvement to be considered.

The one counter that has been expressed is about folder structure but again, we already don't reflect the original data structure when it comes to outputs, which are concentrated in dvc.yaml files that can be anywhere πŸ™‚

shcheklein commented 4 years ago

OK. And doesn't a data: section in dvc.yaml achieve the same goal? I.e. do you oppose the proposed change, @shcheklein?

It is a possible solution, but it complicates the data management (non-pipeline) usage. Imagine a simple case: a single data.xml file and you don't care about pipelines whatsoever. Now you have a single explicit placeholder data.xml.dvc right in the place as you would have data.xml as @karajan1001 mentioned.

If we decide to go with dvc.yaml - we'll have two files instead. Also, since we allow multiple dvc.yamls (one per directory), which one should we modify on dvc add data.xml? It becomes less explicit. It creates a bigger surface for merge conflicts when you deal with multiple independent artifacts.

There are some benefits to this approach as well (e.g. consistency), not saying that it doesn't have any ground to it. My point, probably, is that we should be doing this change for some strong reason if we decide to do this.

jorgeorpinel commented 4 years ago

it complicates the data management (non-pipeline) usage. Imagine a simple case

Agree. But we wouldn't deprecate .dvc files, they would be left as optional mechanism both for simple scenarios and for backwards compatibility.

πŸ’‘ We could even create a command to move data from .dvc files to dvc.yaml/lock (and vice versa?) β€” which could double as an auto-migration command for 0.x users (I believe there's even a feature request for that somewhere).

since we allow multiple dvc.yamls (one per directory), which one should we modify on dvc add data.xml?

Good Q. Either the one in the cwd, or the one in the project's root would seem like intuitive default behavior to me. An option could modify this.

we should be doing this change for some strong reason

Sure. That's pretty subjective to me and you and/or a team consensus are much more able to judge on this. Feel free to review my rant in the description of this issue as well as Saugat's #3936. Also, I'm not suggesting we go for this immediately, just something to seriously consider and possibly bring to the attention of the team.

jorgeorpinel commented 4 years ago

Alright well, since the discussion here seems to have come to some conclusions, I'll actually close this now. I added a summary about it in https://github.com/iterative/dvc/issues/3936#issuecomment-692470880. Thanks!