iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.81k stars 1.18k forks source link

config: add option to enable --no-commit behavior by default #1627

Closed efiop closed 3 years ago

efiop commented 5 years ago

https://github.com/iterative/dvc/issues/919#issuecomment-464754788

dmpetrov commented 5 years ago

This is a simple feature which leads to a quite a big change in API. We should think carefully if it needs to be implemented.

It is okay to be opinionated for a tool when it comes down to basic user behavior. With this kind of feature, we won't be sure anymore on what dvc run ... actually does without asking for config. This command adds more complexity in user behavior and I don't think we should introduce additional complexity until we must (like in 1.0 release).

efiop commented 5 years ago

@dmpetrov Great point! Agreed, especially since 1.0 is not that far away and it might just be worth to release it instead of working on a special config option.

AlJohri commented 5 years ago

I'd also be interested in having a config option (if the maintenance burden isn't that high). It would be great to have it manually enabled by the user now and then enabled by default at 1.0.

Given the fact that its quite a big change in the API it might be worth the slower ramp rather than the hard cut-off.

I don't have a pressing use case, I just found myself typing --no-commit very often.

kevin-hanselman commented 4 years ago

I enthusiastically support this option, and honestly, I support making --no-commit the default. The fact that DVC deviates from Git by auto-committing is quite surprising to myself and the other devs on my team. In my experience adopting DVC thus far, making DVC map more closely to Git would dramatically improve its intuitiveness and therefore ease-of-use.

ghost commented 4 years ago

It is hard to discuss whether it make sense or not to commit data files automatically, or how it deviates from Git's behavior without assuming some background knowledge on how both tools work.

I'll try to put together a comparison of how DVC and Git treat commits (this is from the top of my head, so please do your regular fact & terminology check on this one):

Let's start by describing what happens when you add something on Git and DVC.

Adding a file with Git results in storing that file in a content addressable storage located in .git/objects:

$ git init

Initialized empty Git repository in /home/mroutis/tmp/.git/

$ echo "text" > something
$ git add something
$ tree .git/objects

.git/objects
├── 8e
│  └── 27be7d6154a1f68ea9160ef0e18691d20560dc
├── info
└── pack

$ git cat-file blob 8e27be7d6154a1f68ea9160ef0e18691d20560dc

text

In this storage (.git/objects), Git holds several type of objects:

Now, when you add something with DVC, it also stores that file in a content addressable storage located in (.dvc/cache):

$ dvc init --no-scm
$ echo "text" > something
$ dvc add something
$ tree .dvc/cache

.dvc/cache
└── e1
   └── cbb0c3879af8347246f12c559a86b5

$ cat .dvc/cache/e1/cbb0c3879af8347246f12c559a86b5

text

This storage just holds one type of object, raw file. The dvc add operation could be summarized like this:

file="something"
digest=$(md5sum ${file})
cache_path=".dvc/cache/${digest:0:2}/${digest:2:32}"
mv ${file} ${cache_path}
ln ${cache_path} ${file}  # `cp --reflink`

Git and DVC does a bit more than just putting objects on their content addressable storage (CAS): They update their index. This is important because the index contains the necessary information to re-create the tree / working space at any given time.

Git keeps track of those files added to the CAS using the file located at .git/index, and you can query the content with the git ls-files --stage command.

beware, dragons ahead :dragon:

DVC doesn't have an index file per se, it is distributed among all the dvcfiles :sweat_smile: . So, and index could be viewed as: a list of path names and hash digests that maps the storage to files in the working directory.

dvcfiles have the following structure:

outs:
  - md5: e1cbb0c3879af8347246f12c559a86b5
    path: something

The collection of all the dvcfiles in the working directory is the index.

Then, what happens when you commit?

When you do git commit, you create two objects in its storage, a tree (that is like a snapshot of the current working directory) and a commit one (holding information like "when did it happen", "who's the author" and stuff like that).

This allows you to keep track of the changes and separate them with some context information (usually described in the commit message).

DVC commit is a different story. There's no commit or tree objects, just plain raw files in its store. When you dvc commit it updates the index with the current state of the files. Let's see it in action on the following example:

$ dvc init --no-scm
$ echo "text" > something
$ dvc add something
$ echo "lorem ipsum" > something
$ dvc commit
$ tree .dvc/cache

.dvc/cache
├── 3b
│  └── c34a45d26784b5bea8529db533ae84
└── e1
   └── cbb0c3879af8347246f12c559a86b5

$ cat something.dvc

outs:
- md5: 3bc34a45d26784b5bea8529db533ae84
  path: something

Instead of pointing to the e1cbb... file (the one with text as content), now it is pointing to 3bc34... (which has lorem ipsum as its content).

Thus, making dvc commit more similar to git add.

Why we couldn't use dvc add instead? In this case, you could do that, indeed. But what happens when the dvcfile was generated by a run command?

For example, imagine that you have the following python script under spam.py:

def ham():
    print("Spam, bacon, sausage and Spam")

ham()

Then you generate an output with such script:

dvc run \
  --deps spam.py \
  --outs menu.txt \
  "python spam.py > menu.txt"
$ cat menu.txt.dvc

cmd: python spam.py > menu.txt
deps:
  - md5: fa18e7b5391f72e101a1512e9e890005
    path: spam.py
outs:
  - md5: c2758248a9df27757b3b710162c5a0af
    path: menu.txt

If we update the menu.txt manually and try to add it to update the index, we would lose track of how it was generated (i.e. the cmd and the deps). What about changing the dependency but you are sure that your change is not going to affect the output, like adding the missing docstring or comment that you always forget:

cat <<CODE > spam.py
def ham():
    """Yes. This is a Monthy Python reference."""
    print("Spam, bacon, sausage and Spam")

ham()
CODE

Imagine that spam.py takes hours to compute its final result (as many training scripts), would you want to dvc repro menu.txt.dvc just because you added an insignificant docstring? No. Then you use dvc commit to update the dvcfile accordingly.

With that being said. DVC and Git are different. Thanks for coming to my TED talk.

Jokes aside, we would need to come up with a different naming for dvc commit or find a good explanation about why it is a good name for it. By the way, in Git's context, index, stage, and cache are the same, so we definitely have naming issues between DVC and Git.

Happy to continue the discussion about naming and comparing functionality, please follow up with questions or tomatoes :tomato:


TL;DR: dvc add works like git add, the concept of a commit is different.

shcheklein commented 4 years ago

DVC and Git are different.

And the biggest difference (including decisions, including this ticket) comes from the amount of data tools deal with - KBs vs GBs. Thus for example current default behavior quickly becomes a limiting factor when you do a lot of experiments.

By the way, in Git's context, index, stage, and cache are the same, so we definitely have naming issues between DVC and Git.

I would also first try to analyze user perception of the command names - do they make sense more or less, do users understand and intuitively expect the outcome? Versus trying to completely rely on technical details like internal index structure, etc.

ghost commented 4 years ago

Thus for example current default behavior quickly becomes a limiting factor when you do a lot of experiments.

If committing is an issue, why not using --outs-no-cache?

I'd even question the advantage of using dvc run without "caching" the results.

shcheklein commented 4 years ago

If committing is an issue, why not using --outs-no-cache?

Because some results you want to be saved after all. --outs-no-cache determines if you want to save them with Git/not saved at all or by DVC.

I'd even question the advantage of using dvc run without "caching" the results.

not sure I got what exactly is in doubt here and if it was answered before ;)

jorgeorpinel commented 3 years ago

Hi. Is this still meaningful having the run-cache feature? Thanks

efiop commented 3 years ago

Closing as stale.