Closed efiop closed 3 years ago
This is a simple feature which leads to a quite a big change in API. We should think carefully if it needs to be implemented.
It is okay to be opinionated for a tool when it comes down to basic user behavior. With this kind of feature, we won't be sure anymore on what dvc run ...
actually does without asking for config. This command adds more complexity in user behavior and I don't think we should introduce additional complexity until we must (like in 1.0 release).
@dmpetrov Great point! Agreed, especially since 1.0 is not that far away and it might just be worth to release it instead of working on a special config option.
I'd also be interested in having a config option (if the maintenance burden isn't that high). It would be great to have it manually enabled by the user now and then enabled by default at 1.0.
Given the fact that its quite a big change in the API it might be worth the slower ramp rather than the hard cut-off.
I don't have a pressing use case, I just found myself typing --no-commit
very often.
I enthusiastically support this option, and honestly, I support making --no-commit
the default. The fact that DVC deviates from Git by auto-committing is quite surprising to myself and the other devs on my team. In my experience adopting DVC thus far, making DVC map more closely to Git would dramatically improve its intuitiveness and therefore ease-of-use.
It is hard to discuss whether it make sense or not to commit data files automatically, or how it deviates from Git's behavior without assuming some background knowledge on how both tools work.
I'll try to put together a comparison of how DVC and Git treat commits (this is from the top of my head, so please do your regular fact & terminology check on this one):
Let's start by describing what happens when you add
something on Git and DVC.
Adding a file with Git results in storing that file in a content addressable storage located in .git/objects
:
$ git init
Initialized empty Git repository in /home/mroutis/tmp/.git/
$ echo "text" > something
$ git add something
$ tree .git/objects
.git/objects
├── 8e
│ └── 27be7d6154a1f68ea9160ef0e18691d20560dc
├── info
└── pack
$ git cat-file blob 8e27be7d6154a1f68ea9160ef0e18691d20560dc
text
In this storage (.git/objects
), Git holds several type of objects:
commits
blobs
trees
Now, when you add
something with DVC, it also stores that file in a
content addressable storage located in (.dvc/cache
):
$ dvc init --no-scm
$ echo "text" > something
$ dvc add something
$ tree .dvc/cache
.dvc/cache
└── e1
└── cbb0c3879af8347246f12c559a86b5
$ cat .dvc/cache/e1/cbb0c3879af8347246f12c559a86b5
text
This storage just holds one type of object, raw file. The dvc add
operation could be summarized like this:
file="something"
digest=$(md5sum ${file})
cache_path=".dvc/cache/${digest:0:2}/${digest:2:32}"
mv ${file} ${cache_path}
ln ${cache_path} ${file} # `cp --reflink`
Git and DVC does a bit more than just putting objects on their content addressable storage (CAS): They update their index. This is important because the index contains the necessary information to re-create the tree / working space at any given time.
Git keeps track of those files added to the CAS using the file located at .git/index
, and you can query the content with the git ls-files --stage
command.
beware, dragons ahead :dragon:
DVC doesn't have an index file per se, it is distributed among all the dvcfiles :sweat_smile: . So, and index could be viewed as: a list of path names and hash digests that maps the storage to files in the working directory.
dvcfiles have the following structure:
outs:
- md5: e1cbb0c3879af8347246f12c559a86b5
path: something
The collection of all the dvcfiles in the working directory is the index.
Then, what happens when you commit
?
When you do git commit
, you create two objects in its storage, a tree (that is like a snapshot of the current working directory) and a commit one (holding information like "when did it happen", "who's the author" and stuff like that).
This allows you to keep track of the changes and separate them with some context information (usually described in the commit message).
DVC commit
is a different story.
There's no commit or tree objects, just plain raw files in its store.
When you dvc commit
it updates the index with the current state of the files.
Let's see it in action on the following example:
$ dvc init --no-scm
$ echo "text" > something
$ dvc add something
$ echo "lorem ipsum" > something
$ dvc commit
$ tree .dvc/cache
.dvc/cache
├── 3b
│ └── c34a45d26784b5bea8529db533ae84
└── e1
└── cbb0c3879af8347246f12c559a86b5
$ cat something.dvc
outs:
- md5: 3bc34a45d26784b5bea8529db533ae84
path: something
Instead of pointing to the e1cbb...
file (the one with text
as content), now it is pointing to 3bc34...
(which has lorem ipsum
as its content).
Thus, making dvc commit
more similar to git add
.
Why we couldn't use dvc add
instead? In this case, you could do that, indeed. But what happens when the dvcfile was generated by a run
command?
For example, imagine that you have the following python script under spam.py
:
def ham():
print("Spam, bacon, sausage and Spam")
ham()
Then you generate an output with such script:
dvc run \
--deps spam.py \
--outs menu.txt \
"python spam.py > menu.txt"
$ cat menu.txt.dvc
cmd: python spam.py > menu.txt
deps:
- md5: fa18e7b5391f72e101a1512e9e890005
path: spam.py
outs:
- md5: c2758248a9df27757b3b710162c5a0af
path: menu.txt
If we update the menu.txt
manually and try to add it to update the index, we would lose track of how it was generated (i.e. the cmd
and the deps
).
What about changing the dependency but you are sure that your change is not going to affect the output, like adding the missing docstring or comment that you always forget:
cat <<CODE > spam.py
def ham():
"""Yes. This is a Monthy Python reference."""
print("Spam, bacon, sausage and Spam")
ham()
CODE
Imagine that spam.py
takes hours to compute its final result (as many training scripts), would you want to dvc repro menu.txt.dvc
just because you added an insignificant docstring? No. Then you use dvc commit
to update the dvcfile accordingly.
With that being said. DVC and Git are different. Thanks for coming to my TED talk.
Jokes aside, we would need to come up with a different naming for dvc commit
or find a good explanation about why it is a good name for it.
By the way, in Git's context, index, stage, and cache are the same, so we definitely have naming issues between DVC and Git.
Happy to continue the discussion about naming and comparing functionality, please follow up with questions or tomatoes :tomato:
TL;DR: dvc add
works like git add
, the concept of a commit is different.
DVC and Git are different.
And the biggest difference (including decisions, including this ticket) comes from the amount of data tools deal with - KBs vs GBs. Thus for example current default behavior quickly becomes a limiting factor when you do a lot of experiments.
By the way, in Git's context, index, stage, and cache are the same, so we definitely have naming issues between DVC and Git.
I would also first try to analyze user perception of the command names - do they make sense more or less, do users understand and intuitively expect the outcome? Versus trying to completely rely on technical details like internal index structure, etc.
Thus for example current default behavior quickly becomes a limiting factor when you do a lot of experiments.
If committing is an issue, why not using --outs-no-cache
?
I'd even question the advantage of using dvc run
without "caching" the results.
If committing is an issue, why not using --outs-no-cache?
Because some results you want to be saved after all. --outs-no-cache
determines if you want to save them with Git/not saved at all or by DVC.
I'd even question the advantage of using dvc run without "caching" the results.
not sure I got what exactly is in doubt here and if it was answered before ;)
Hi. Is this still meaningful having the run-cache feature? Thanks
Closing as stale.
https://github.com/iterative/dvc/issues/919#issuecomment-464754788