Open villasv opened 6 years ago
Hi @villasv !
Great idea! We will be sure to get to implementing it after 0.9.8.
Thanks, Ruslan
Thank you @villasv. The idea is great!
But we should understand that in many cases dvc works in Gb size scale where diff has no meaning. The command might be abbused and many users can come up with a conclusion that dvc diff
is a buggy\slow command.
Let's check file size before dvc diff
and exit with an error if the size if exceeds the limit. The limit should be defined in dvc config file. Default value .... How do you think about 100Mb
?
How do you guys think about introducing the limit?
@dmpetrov I thought about introducing a limit, but there is actually no good reason to do that, since we can simply leave it for user to decide whether he wants to kill dvc diff
if it takes too long or if he wants to wait.
btw... there are no diff
s in HDFS. It is one more thing to consider.
just throwing out some idea - hide the FS specific and SCM\Git specific commands under a special command like dvc whatever diff
I think that performance is a valid concern, and probably some aggressive warnings should be issued beforehand since dvc knows the file sizes instead of simply forbidding a certain file size. Perhaps some people will still want to diff
even if it takes hours and outputs Gbs of data as well.
One extra care is that the previous revision from the git
POV might not have the output file in the local cache, in which case it's necessary to either do a dvc pull
or in the worst case a retroactive dvc repro
.
Relevant new development: https://youtu.be/fw6P6VFPo24
@villasv I agree with you. File size limitation is definitely not a way to go. We might add a warning, but considering other tools(e.g. plain diff and git diff) don't do that, I think that we are pretty safe not printing anything as well. If the operation takes too long user could just CTRL+C it as usual :)
Missing local cache is also a great point and we will be sure to account for that(e.g. maybe something like proposed --fetch for dvc metrics
).
Thanks for the link!
Just a heads up. I investigated the library I mentioned (tdda) but it wasn't of much help. I made a small script that achieves what I want for CSV and JSONLines. I'm pretty sure if this was incorporated in DVC it would be a bit cleaner, because I woudn't need to invoke so many subprocesses, but I decided to implement it standalone instead of inside a PR just for a Proof of Concept, and maybe we don't really want to put something so file-type-specific and formatting-opinionated into DVC (yet).
Perhaps a dvc plugin? The first of its kind?
My goal was to inspect each line separately and aggregate additions (pprint
the whole item), deletions (print
the id of the deleted item) and editions (pprint
a jsondiff
).
Examples using my own project:
$ python code/reftest.py diff --data-file data/orgs.txt --index-path wikidata.@id
Comparing data/orgs.txt with ./.dvc/cache/e0/6bd8b3a5945f043687c7a4256b0291
Removed:
http://www.wikidata.org/entity/Q0
http://www.wikidata.org/entity/Q1
---
Edited:
http://www.wikidata.org/entity/Q414163 {update: {'wikidata': {update: {'i18n': {update: {'en': {update: {'label': 'Academy '
'of '
'Sciences '
'and '
'Literature '
'Mainz'}}}}}}}}
http://www.wikidata.org/entity/Q503473 {update: {'wikidata': {update: {'i18n': {update: {'en': {update: {'names': {insert: [(2,
'UNIGE')]}}}}}}}}}
http://www.wikidata.org/entity/Q797585 {update: {'wikidata': {update: {'i18n': {update: {'en': {update: {'names': ['Babasaheb '
'Ambedkar '
'University',
'Bhimrao '
'Ambedkar '
'University']}}}}}}}}
http://www.wikidata.org/entity/Q1619487 {update: {'wikidata': {update: {'i18n': {update: {'en': {update: {'label': 'Willy-Brandt-school'}}}}}}}}
---
Added:
{'wikidata': {'@id': 'http://www.wikidata.org/entity/Q28704642',
'i18n': {'en': {'label': 'École Française Internationale de Kiev',
'names': ['EFIK',
'École française internationale de '
'Kiev']}}}}
{'wikidata': {'@id': 'http://www.wikidata.org/entity/Q57420477',
'i18n': {'en': {'label': 'Kerch Polytechnic College',
'names': []}}}}
{'wikidata': {'@id': 'http://www.wikidata.org/entity/Q57428768',
'i18n': {'en': {'label': None, 'names': []}}}}
---
Total: R(2) / E(1) / A(1)
$ python code/reftest.py diff --data-file data/orgs.csv --index-col 0
Comparing data/orgs.csv with ./.dvc/cache/99/769e4afd29269400056d9b17cc39a8
Removed:
---
Edited:
Q414163 {delete: [9], insert: [(9, 'Academy of Sciences and Literature Mainz')]}
Q1619487 {delete: [9], insert: [(9, 'Willy-Brandt-school')]}
Q1698809 {delete: [9], insert: [(9, 'Johanneum Breslau')]}
Q3438435 {delete: [9], insert: [(9, 'San Marcos University')]}
---
Added:
['Q28704642',
'',
'',
'',
'',
'',
'',
'',
'',
'École Française Internationale de Kiev',
'',
'',
'',
'',
'']
['Q57420477',
'',
'',
'',
'',
'',
'',
'',
'',
'Kerch Polytechnic College',
'',
'',
'',
'',
'']
['Q57428768', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
---
Total: R(0) / E(4) / A(3)
The script:
import csv
import difflib
import fire
import json
import jsondiff
import pprint
import yaml
import subprocess
import sys
class ReftestManager:
def __init__(self, data_file):
self._new = data_file
self._old = data_file + '.dvc'
# now we translate _old to its true cache path
self._dvc_root = subprocess.Popen(
['dvc', 'root'],
stdout=subprocess.PIPE,
).communicate()[0].rstrip().decode('utf-8')
dvc_file = subprocess.Popen(
['git', 'show', f'HEAD:{self._old}'],
stdout=subprocess.PIPE,
).communicate()[0].rstrip().decode('utf-8')
dvc_spec = yaml.load(dvc_file)
cache_md5 = next(
out['md5']
for out in dvc_spec['outs']
if out['path'].split('/')[-1] == data_file.split('/')[-1]
)
self._old = f'{self._dvc_root}/.dvc/cache/{cache_md5[:2]}/{cache_md5[2:]}'
print(f"Comparing {self._new} with {self._old}")
def diff(self, index_col=None, index_path=None):
if not ((index_col is not None) ^ (index_path is not None)):
raise ValueError("Inform either index col or index path")
with open(self._new) as fnew:
new_lines = fnew.readlines()
with open(self._old) as fold:
old_lines = fold.readlines()
diff = [
line for line in
difflib.unified_diff(
old_lines, new_lines,
fromfile=self._old, tofile=self._new, n=0,
)
if not line.startswith('---')
and not line.startswith('+++')
and not line.startswith('@@')
]
rmved = [line[1:] for line in diff if line[0] == '-']
added = [line[1:] for line in diff if line[0] == '+']
if index_col is not None:
rmved = {
line[index_col]: line
for line in csv.reader(rmved)
}
added = {
line[index_col]: line
for line in csv.reader(added)
}
if index_path is not None:
def follow_path(json, key_chain):
for key in key_chain:
json = json[key]
return json
key_chain = index_path.split('.')
rmved = {
follow_path(json.loads(line), key_chain): json.loads(line)
for line in rmved
}
added = {
follow_path(json.loads(line), key_chain): json.loads(line)
for line in added
}
edits = {k:(v,added[k]) for k,v in rmved.items() if k in added}
rmved = {k:v for k,v in rmved.items() if k not in edits }
added = {k:v for k,v in added.items() if k not in edits }
print("Removed:")
for k in rmved:
print(k)
print("---")
print("Edited:")
for k,(old,new) in edits.items():
print(k, pprint.pformat(jsondiff.diff(old, new, syntax='explicit')))
print("---")
print("Added:")
for k in added:
pprint.pprint(added[k])
print("---")
print(f"Total: R({len(rmved)}) / E({len(edits)}) / A({len(added)})")
if __name__ == "__main__":
fire.Fire(ReftestManager)
Unfortunately, because I invoke dvc
as a subproccess, sometimes I get the output of dvc root
while checking for updates...:
FileNotFoundError: [Errno 2] No such file or directory: 'Checking for updates...\n./.dvc/cache/e0/6bd8b3a5945f043687c7a4256b0291'
:-) but that's really minor
Hi @villasv !
This looks amazing! I see no reason to make it a plugin, since it looks very suiting to be a part of core dvc functionality. Would you like to file a PR with dvc diff
? Please feel free to ping us if you need any help.
Thanks, Ruslan
@villasv Btw, we have introduced a community chat recently at https://dvc.org/chat and we would be very honored to have you there :slightly_smiling_face:
Big thanks to @django-kz for contributing dvc diff
command in https://github.com/iterative/dvc/pull/1778 ! :tada: :rocket: Currently it shows a difference in a number of files and their sizes between git revisions. We could add actual data diff functionality on top of it :slightly_smiling_face: dvc diff
is going to be released in 0.35.1 today, please feel free to give it a try!
Hi, I just discovered this issue mentioned on Discord. Here's an idea: Not ideal but we could provide a short guide in the docs for now on how to use git checkout
, dvc checkout
, copy the tracked text file(s) in question (from 2 different versions) to /tmp/a/ and /tmp/b/ , and run diff
on them manually.
That would be amazing @jorgeorpinel ! Great idea!
Don't like to put this hack into the official docs to be honest. Especially if we are thinking about implementing this in the future. I would document this workaround in this ticket and may be put a link from the doc.
Or a trick I’ve seen around: a question and semi-official-but-ok-to-become-dated answer on Stack Overflow
OK here's a quick example on how to do it based on https://github.com/iterative/example-get-started (assumes the project has been cloned and user moved into the local repo). Also, we'll focus on files model.pkl
and auc.metric
:
EDIT: The behavior of dvc diff
has changed slightly over time, but the general idea should still be valid.
$ dvc pull -aT
Preparing to download data from 'https://remote.dvc.org/get-started'
...
$ dvc diff HEAD^
dvc diff from 30a96ce to 6c73875
...
diff for 'model.pkl'
-model.pkl with md5 a66489653d1b6a8ba989799367b32c43
+model.pkl with md5 3863d0e317dee0a55c4e59d2ec0eef33
...
diff for 'auc.metric'
-auc.metric with md5 f58e5ccd66bf1195b53f458e7f619ab8
...
$ mkdir /tmp/a
$ cp model.pkl /tmp/a/model.pkl
$ cp auc.metric /tmp/a/auc.metric
$ git checkout HEAD^
...
$ dvc checkout
[##############################] 100% Checkout finished!
$ mkdir /tmp/b
$ cp model.pkl /tmp/b/model.pkl
$ cp auc.metric /tmp/b/auc.metric
$ diff /tmp/a/model.pkl /tmp/b/model.pkl
Binary files /tmp/a/model.pkl and /tmp/b/model.pkl differ
Finally, use diff
once all the changed files of interest are ready to be compared:
$ diff /tmp/a/auc.metric /tmp/b/auc.metric
1c1
< 0.602818
---
> 0.588426
UPDATE: Link to this comment added to https://dvc.org/doc/command-reference/diff in https://github.com/iterative/dvc.org/pull/490/commits/9ce552a426387731c1a62f81e57b0a4fc4c7353e
dvc diff
cmd ref (possibly removing the note that links to this issue) and tam completion scripts (if needed) when/if this is addressed.Hi, I have read the docs and this thread but I still have problems with dvc diff
command. Probably I'm missing something and I would be glad if someone will correct me. I have a problematic use case which I describe below.
I have a directory with several binary files under control of dvc. Whenever I execute dvc diff
I get statistics about changed files. For example:
$dvc diff -t data/myfiles 2a19a244a218088c8f1313f5c528d9cf878bb7af
diff for 'data/myfiles'
-data/myfiles with md5 51cf02d89b91e204a0876563823f2c90.dir
+data/myfiles with md5 3e5c51dbc302efbd9e9c821483a19e8f.dir
116 files untouched, 1 file modified, 1 file added, 0 files deleted, size was increased by 12.3 kB
The output is not really helpful because it doesn't provide any information what exactly has changed in the directory. Of course it shows me that some files have changed but I knew it anyway. Since I work with binary files I don't think I need line-by-line diff but rather list of changed binary files. The possible output for dvc diff
may look like this:
$dvc diff -t data/myfiles 2a19a244a218088c8f1313f5c528d9cf878bb7af
diff for 'data/myfiles'
-data/myfiles with md5 51cf02d89b91e204a0876563823f2c90.dir
+data/myfiles with md5 3e5c51dbc302efbd9e9c821483a19e8f.dir
116 files untouched, 1 file modified, 1 file added, 0 files deleted, size was increased by 12.3 kB
Modified:
-data/myfiles/117.jpg
Added:
-data/myfiles/118.jpg
P.S.:
Of course I can checkout different versions of data/myfiles
, save them to temporary directories and then calculate hash sums for each pair of files in those temporary directories but it does seems like too much work for such a simple task. I really hope there is a better way.
Thank you @nik123 ! I've just created a detailed issue about the dvc diff
output #2982. Please vote for it! And please add comments if I missed anything.
The output is not really helpful because it doesn't provide any information what exactly has changed in the directory.
@nik123 If the directory was dvc add
ed as a whole, currently DVC doesn't examine what's inside so you only get those general stats. We are working on providing this kind of granularity for all of our commands though 🙂 Please see Dmitry's https://github.com/iterative/dvc/issues/2982#issuecomment-568715481 like he mentioned.
@nik123 , @dmpetrov - thank you for your response! The issue #2982 is exactly what I need
Big thanks to @django-kz for contributing
dvc diff
command in #1778 ! 🎉 🚀 Currently it shows a difference in a number of files and their sizes between git revisions. We could add actual data diff functionality on top of it 🙂dvc diff
is going to be released in 0.35.1 today, please feel free to give it a try! @efiop
Copied from Discord, qna channel:
I think what the comment there meant to me, and what my use case needs, is a data diff. Let's say I changed the dataset, then committed the change. How do I know what changed? Right now dvc diff only shows what files changed. It would've been so beneficial if it shows something like git diff with a line by line diff
Hi @ammarasmro
It would've been so beneficial if it shows something like git diff with a line by line diff
Right, this is discussed above. There are probably thousands of data formats so this is not easy as with regular diff which only works on plain text files. Do you have a more specific use case? E.g. a certain data format or a set of formats? Thanks
So the particular use case that raised this issue was a text dataset. The process we, the ML team, have used, is that after the data team gets data in any format, they process it and we get it as CSV, TSV, TXT, JSON, .md files. So alot of our experiments at least start with a text data file. We also use other formats like pickle
but I'm guessing diffing would be more complicated than text.
Would be great to have line-by-line comparison at least for "text-based" files (TXT, CSV, JSON, MD, YAML, TOML, XLM, ...). By the way, great job with dvc
, it's making my life easier!
@ammarasmro and @stefanocoretta for now you can refer to the https://github.com/iterative/dvc/issues/770#issuecomment-512693256 above for a procedure to do that. It's not clear that we want such a feature for really large files, even when they're plaintext. But yes, maybe!
a request from a high-priority client:
dvc get --rev commit_sha -o myfile.csv.temp . myfile.csv
diff myfile.csv.temp myfile.csv
Should it fall within DVC's feature set? Maybe we can find better diff tools to recommend instead.
The request was mostly about an easy way to compare any text files versioned by DVC and not so much about showing data format-specific diff.
I presume many people would expect dvc diff
to work similarly to git diff
, but for files versioned by DVC.
people would expect dvc diff to work similarly to git diff, but for files versioned by DVC
Maybe new users who are familiar with Git. But would DS/ ML Engs expect a printed line-to-line comparison of huge dataset versions?
Maybe dvc diff --targets
can auto-detect small text files (or/and have another flag and/or config param for this) and print a diff instead of/ in addition to a summary. Cc @dberenbaum
It makes sense to me as a feature request. Probably not as the default, but if people specifically want to see the diff of the contents of the files, I see no reason we shouldn't let them. It's more an issue of warning the users it could be expensive and/or unhelpful, and I'm not sure when we could prioritize this since there are pretty straightforward workarounds.
In terms of implementation, maybe some sort of hidden DVC experiments (custom Git refs) can be used and then literally git diff
the file versions internally.
We shouldn't be staging things in git that are supposed to be gitignored, even in hidden refs.
I think what we want here is just something equivalent to git's .gitattributes + difftool support: https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes
So you would tell DVC what external diff tool you want it to use for what file extensions. And then based on file extension, we would just automatically call the correct tool depending on what file the user is trying to diff. This also lets you extend support for diffing beyond text files. So you could do things like pass image formats into gui image diffing tools, and arbitrary binary files into binary diff tools, etc
@alex000kim If these are only single-entry .dvc
files, they can get text diffs with git diff
using the external diff approach above.
Create dvcfile2text.py
at the root of the repo that looks like:
import ruamel.yaml as yaml
import sys
from dvc.repo import Repo
dvcfile = sys.argv[1]
with open(dvcfile) as f:
content = yaml.safe_load(f)
md5 = content["outs"][0]["md5"]
repo = Repo()
path = repo.odb.local.oid_to_path(md5)
with open(path) as f:
print(f.read())
The add to .git/config
:
[diff "dvcfile"]
textconv = python dvcfile2text.py
cachetextconv = true
And finally add to .gitattributes
:
*.dvc diff=dvcfile
This is hacky and won't work for directories, but a dev could probably make it robust and working with directories without too much effort.
We could consider providing external diff drivers for git that diff .dvc and dvc.lock files, starting with showing output similar to what dvc diff
does today (except it would be shown inside git diff
). Not something that's likely to be a high priority right now, though.
Great, thanks for the example, however hacky :)
Not something that's likely to be a high priority right now, though.
Understood.
A relatively simple way to support tooling like this without worrying about plugins for DVC itself would be to allow dvc get
to redirect the file contents to stdout instead of copying them to the working directory. Here's an example that doesn't work, but demonstrates the workflow that would be possible, using jd
for JSON diffs:
dvc get --rev REV-HASH -o /dev/stdout . data/output.json --force | jd data/output.json
Right now this command fails because it's attempting to write a temp file to /dev and can't. It seems like it could be made to work without relying on /dev/stdout
, though.
dvc
plays really well withgit
, but one thing that I still miss in a data version control system that I really value in source version control systems is the tooling to inspect patches. Because data should be a deterministic reproducible output from the source code, almost all important changes are found in the code history. But frequently I also want to inspect what changed in the data.Concrete scenario: let's say that I have an input csv and a transformation script that outputs a sanitized version of that input file. Then, I make a very small change in the sanitizing strategy, and run the command again. I can see that the output file changed hash, so I know that my code indeed changed behavior. But 99.999% of the outputs data points stayed the same, just a very minor portion of the file changed.
How can I inspect the data points that changed? Or the files in a very large output directory that changed? I can't
git diff
those files anymore because they're ignored. I believe that this is achievable with dvc.