iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.85k stars 1.18k forks source link

dvc diff or some dataset management tooling #770

Open villasv opened 6 years ago

villasv commented 6 years ago

dvc plays really well with git, but one thing that I still miss in a data version control system that I really value in source version control systems is the tooling to inspect patches. Because data should be a deterministic reproducible output from the source code, almost all important changes are found in the code history. But frequently I also want to inspect what changed in the data.

Concrete scenario: let's say that I have an input csv and a transformation script that outputs a sanitized version of that input file. Then, I make a very small change in the sanitizing strategy, and run the command again. I can see that the output file changed hash, so I know that my code indeed changed behavior. But 99.999% of the outputs data points stayed the same, just a very minor portion of the file changed.

How can I inspect the data points that changed? Or the files in a very large output directory that changed? I can't git diff those files anymore because they're ignored. I believe that this is achievable with dvc.

efiop commented 6 years ago

Hi @villasv !

Great idea! We will be sure to get to implementing it after 0.9.8.

Thanks, Ruslan

dmpetrov commented 6 years ago

Thank you @villasv. The idea is great!

But we should understand that in many cases dvc works in Gb size scale where diff has no meaning. The command might be abbused and many users can come up with a conclusion that dvc diff is a buggy\slow command.

Let's check file size before dvc diff and exit with an error if the size if exceeds the limit. The limit should be defined in dvc config file. Default value .... How do you think about 100Mb?

How do you guys think about introducing the limit?

efiop commented 6 years ago

@dmpetrov I thought about introducing a limit, but there is actually no good reason to do that, since we can simply leave it for user to decide whether he wants to kill dvc diff if it takes too long or if he wants to wait.

dmpetrov commented 6 years ago

btw... there are no diffs in HDFS. It is one more thing to consider.

just throwing out some idea - hide the FS specific and SCM\Git specific commands under a special command like dvc whatever diff

villasv commented 6 years ago

I think that performance is a valid concern, and probably some aggressive warnings should be issued beforehand since dvc knows the file sizes instead of simply forbidding a certain file size. Perhaps some people will still want to diff even if it takes hours and outputs Gbs of data as well.

One extra care is that the previous revision from the git POV might not have the output file in the local cache, in which case it's necessary to either do a dvc pull or in the worst case a retroactive dvc repro.

villasv commented 6 years ago

Relevant new development: https://youtu.be/fw6P6VFPo24

efiop commented 6 years ago

@villasv I agree with you. File size limitation is definitely not a way to go. We might add a warning, but considering other tools(e.g. plain diff and git diff) don't do that, I think that we are pretty safe not printing anything as well. If the operation takes too long user could just CTRL+C it as usual :)

Missing local cache is also a great point and we will be sure to account for that(e.g. maybe something like proposed --fetch for dvc metrics).

Thanks for the link!

villasv commented 6 years ago

Just a heads up. I investigated the library I mentioned (tdda) but it wasn't of much help. I made a small script that achieves what I want for CSV and JSONLines. I'm pretty sure if this was incorporated in DVC it would be a bit cleaner, because I woudn't need to invoke so many subprocesses, but I decided to implement it standalone instead of inside a PR just for a Proof of Concept, and maybe we don't really want to put something so file-type-specific and formatting-opinionated into DVC (yet).

Perhaps a dvc plugin? The first of its kind?

My goal was to inspect each line separately and aggregate additions (pprint the whole item), deletions (print the id of the deleted item) and editions (pprint a jsondiff).

Examples using my own project:

$ python code/reftest.py diff --data-file data/orgs.txt --index-path wikidata.@id
Comparing data/orgs.txt with ./.dvc/cache/e0/6bd8b3a5945f043687c7a4256b0291
Removed:
http://www.wikidata.org/entity/Q0
http://www.wikidata.org/entity/Q1
---
Edited:
http://www.wikidata.org/entity/Q414163 {update: {'wikidata': {update: {'i18n': {update: {'en': {update: {'label': 'Academy '
                                                                           'of '
                                                                           'Sciences '
                                                                           'and '
                                                                           'Literature '
                                                                           'Mainz'}}}}}}}}
http://www.wikidata.org/entity/Q503473 {update: {'wikidata': {update: {'i18n': {update: {'en': {update: {'names': {insert: [(2,
                                                                                      'UNIGE')]}}}}}}}}}
http://www.wikidata.org/entity/Q797585 {update: {'wikidata': {update: {'i18n': {update: {'en': {update: {'names': ['Babasaheb '
                                                                            'Ambedkar '
                                                                            'University',
                                                                            'Bhimrao '
                                                                            'Ambedkar '
                                                                            'University']}}}}}}}}
http://www.wikidata.org/entity/Q1619487 {update: {'wikidata': {update: {'i18n': {update: {'en': {update: {'label': 'Willy-Brandt-school'}}}}}}}}
---
Added:
{'wikidata': {'@id': 'http://www.wikidata.org/entity/Q28704642',
              'i18n': {'en': {'label': 'École Française Internationale de Kiev',
                              'names': ['EFIK',
                                        'École française internationale de '
                                        'Kiev']}}}}
{'wikidata': {'@id': 'http://www.wikidata.org/entity/Q57420477',
              'i18n': {'en': {'label': 'Kerch Polytechnic College',
                              'names': []}}}}
{'wikidata': {'@id': 'http://www.wikidata.org/entity/Q57428768',
              'i18n': {'en': {'label': None, 'names': []}}}}
---
Total: R(2) / E(1) / A(1)
$ python code/reftest.py diff --data-file data/orgs.csv --index-col 0
Comparing data/orgs.csv with ./.dvc/cache/99/769e4afd29269400056d9b17cc39a8
Removed:
---
Edited:
Q414163 {delete: [9], insert: [(9, 'Academy of Sciences and Literature Mainz')]}
Q1619487 {delete: [9], insert: [(9, 'Willy-Brandt-school')]}
Q1698809 {delete: [9], insert: [(9, 'Johanneum Breslau')]}
Q3438435 {delete: [9], insert: [(9, 'San Marcos University')]}
---
Added:
['Q28704642',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'École Française Internationale de Kiev',
 '',
 '',
 '',
 '',
 '']
['Q57420477',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Kerch Polytechnic College',
 '',
 '',
 '',
 '',
 '']
['Q57428768', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
---
Total: R(0) / E(4) / A(3)

The script:

import csv
import difflib
import fire
import json
import jsondiff
import pprint
import yaml
import subprocess
import sys

class ReftestManager:
    def __init__(self, data_file):
        self._new = data_file
        self._old = data_file + '.dvc'

        # now we translate _old to its true cache path

        self._dvc_root = subprocess.Popen(
            ['dvc', 'root'],
            stdout=subprocess.PIPE,
        ).communicate()[0].rstrip().decode('utf-8')

        dvc_file = subprocess.Popen(
            ['git', 'show', f'HEAD:{self._old}'],
            stdout=subprocess.PIPE,
        ).communicate()[0].rstrip().decode('utf-8')
        dvc_spec = yaml.load(dvc_file)

        cache_md5 = next(
            out['md5']
            for out in dvc_spec['outs']
            if out['path'].split('/')[-1] == data_file.split('/')[-1]
        )

        self._old = f'{self._dvc_root}/.dvc/cache/{cache_md5[:2]}/{cache_md5[2:]}'

        print(f"Comparing {self._new} with {self._old}")

    def diff(self, index_col=None, index_path=None):
        if not ((index_col is not None) ^ (index_path is not None)):
            raise ValueError("Inform either index col or index path")

        with open(self._new) as fnew:
            new_lines = fnew.readlines()
        with open(self._old) as fold:
            old_lines = fold.readlines()

        diff = [
            line for line in
            difflib.unified_diff(
                old_lines, new_lines,
                fromfile=self._old, tofile=self._new, n=0,
            )
            if not line.startswith('---')
                and not line.startswith('+++')
                and not line.startswith('@@')
        ]

        rmved = [line[1:] for line in diff if line[0] == '-']
        added = [line[1:] for line in diff if line[0] == '+']

        if index_col is not None:
            rmved = {
                line[index_col]: line
                for line in csv.reader(rmved)
            }
            added = {
                line[index_col]: line
                for line in csv.reader(added)
            }

        if index_path is not None:
            def follow_path(json, key_chain):
                for key in key_chain:
                    json = json[key]
                return json
            key_chain = index_path.split('.')
            rmved = {
                follow_path(json.loads(line), key_chain): json.loads(line)
                for line in rmved
            }
            added = {
                follow_path(json.loads(line), key_chain): json.loads(line)
                for line in added
            }

        edits = {k:(v,added[k]) for k,v in rmved.items() if k in added}
        rmved = {k:v for k,v in rmved.items() if k not in edits }
        added = {k:v for k,v in added.items() if k not in edits }

        print("Removed:")
        for k in rmved:
            print(k)
        print("---")

        print("Edited:")
        for k,(old,new) in edits.items():
            print(k, pprint.pformat(jsondiff.diff(old, new, syntax='explicit')))
        print("---")

        print("Added:")
        for k in added:
            pprint.pprint(added[k])
        print("---")

        print(f"Total: R({len(rmved)}) / E({len(edits)}) / A({len(added)})")

if __name__ == "__main__":
    fire.Fire(ReftestManager)
villasv commented 6 years ago

Unfortunately, because I invoke dvc as a subproccess, sometimes I get the output of dvc root while checking for updates...:

FileNotFoundError: [Errno 2] No such file or directory: 'Checking for updates...\n./.dvc/cache/e0/6bd8b3a5945f043687c7a4256b0291'

:-) but that's really minor

efiop commented 6 years ago

Hi @villasv !

This looks amazing! I see no reason to make it a plugin, since it looks very suiting to be a part of core dvc functionality. Would you like to file a PR with dvc diff? Please feel free to ping us if you need any help.

Thanks, Ruslan

efiop commented 6 years ago

@villasv Btw, we have introduced a community chat recently at https://dvc.org/chat and we would be very honored to have you there :slightly_smiling_face:

efiop commented 5 years ago

Big thanks to @django-kz for contributing dvc diff command in https://github.com/iterative/dvc/pull/1778 ! :tada: :rocket: Currently it shows a difference in a number of files and their sizes between git revisions. We could add actual data diff functionality on top of it :slightly_smiling_face: dvc diff is going to be released in 0.35.1 today, please feel free to give it a try!

jorgeorpinel commented 5 years ago

Hi, I just discovered this issue mentioned on Discord. Here's an idea: Not ideal but we could provide a short guide in the docs for now on how to use git checkout, dvc checkout, copy the tracked text file(s) in question (from 2 different versions) to /tmp/a/ and /tmp/b/ , and run diff on them manually.

efiop commented 5 years ago

That would be amazing @jorgeorpinel ! Great idea!

shcheklein commented 5 years ago

Don't like to put this hack into the official docs to be honest. Especially if we are thinking about implementing this in the future. I would document this workaround in this ticket and may be put a link from the doc.

villasv commented 5 years ago

Or a trick I’ve seen around: a question and semi-official-but-ok-to-become-dated answer on Stack Overflow

jorgeorpinel commented 5 years ago

OK here's a quick example on how to do it based on https://github.com/iterative/example-get-started (assumes the project has been cloned and user moved into the local repo). Also, we'll focus on files model.pkl and auc.metric:

EDIT: The behavior of dvc diff has changed slightly over time, but the general idea should still be valid.

$ dvc pull -aT
Preparing to download data from 'https://remote.dvc.org/get-started'
...
$ dvc diff HEAD^
dvc diff from 30a96ce to 6c73875
...
diff for 'model.pkl'
-model.pkl with md5 a66489653d1b6a8ba989799367b32c43
+model.pkl with md5 3863d0e317dee0a55c4e59d2ec0eef33
...
diff for 'auc.metric'
-auc.metric with md5 f58e5ccd66bf1195b53f458e7f619ab8
...
$ mkdir /tmp/a
$ cp model.pkl /tmp/a/model.pkl
$ cp auc.metric /tmp/a/auc.metric
$ git checkout HEAD^
...
$ dvc checkout
[##############################] 100% Checkout finished!
$ mkdir /tmp/b
$ cp model.pkl /tmp/b/model.pkl
$ cp auc.metric /tmp/b/auc.metric
$ diff /tmp/a/model.pkl /tmp/b/model.pkl 
Binary files /tmp/a/model.pkl and /tmp/b/model.pkl differ

Finally, use diff once all the changed files of interest are ready to be compared:

$ diff /tmp/a/auc.metric /tmp/b/auc.metric
1c1
< 0.602818
---
> 0.588426

UPDATE: Link to this comment added to https://dvc.org/doc/command-reference/diff in https://github.com/iterative/dvc.org/pull/490/commits/9ce552a426387731c1a62f81e57b0a4fc4c7353e

jorgeorpinel commented 4 years ago
nik123 commented 4 years ago

Hi, I have read the docs and this thread but I still have problems with dvc diff command. Probably I'm missing something and I would be glad if someone will correct me. I have a problematic use case which I describe below.

I have a directory with several binary files under control of dvc. Whenever I execute dvc diff I get statistics about changed files. For example:

$dvc diff -t data/myfiles 2a19a244a218088c8f1313f5c528d9cf878bb7af
diff for 'data/myfiles'
-data/myfiles with md5 51cf02d89b91e204a0876563823f2c90.dir
+data/myfiles with md5 3e5c51dbc302efbd9e9c821483a19e8f.dir

116 files untouched, 1 file modified, 1 file added, 0 files deleted, size was increased by 12.3 kB

The output is not really helpful because it doesn't provide any information what exactly has changed in the directory. Of course it shows me that some files have changed but I knew it anyway. Since I work with binary files I don't think I need line-by-line diff but rather list of changed binary files. The possible output for dvc diff may look like this:

$dvc diff -t data/myfiles 2a19a244a218088c8f1313f5c528d9cf878bb7af
diff for 'data/myfiles'
-data/myfiles with md5 51cf02d89b91e204a0876563823f2c90.dir
+data/myfiles with md5 3e5c51dbc302efbd9e9c821483a19e8f.dir

116 files untouched, 1 file modified, 1 file added, 0 files deleted, size was increased by 12.3 kB

Modified:
-data/myfiles/117.jpg

Added:
-data/myfiles/118.jpg

P.S.: Of course I can checkout different versions of data/myfiles, save them to temporary directories and then calculate hash sums for each pair of files in those temporary directories but it does seems like too much work for such a simple task. I really hope there is a better way.

dmpetrov commented 4 years ago

Thank you @nik123 ! I've just created a detailed issue about the dvc diff output #2982. Please vote for it! And please add comments if I missed anything.

jorgeorpinel commented 4 years ago

The output is not really helpful because it doesn't provide any information what exactly has changed in the directory.

@nik123 If the directory was dvc added as a whole, currently DVC doesn't examine what's inside so you only get those general stats. We are working on providing this kind of granularity for all of our commands though 🙂 Please see Dmitry's https://github.com/iterative/dvc/issues/2982#issuecomment-568715481 like he mentioned.

nik123 commented 4 years ago

@nik123 , @dmpetrov - thank you for your response! The issue #2982 is exactly what I need

ammarasmro commented 4 years ago

Big thanks to @django-kz for contributing dvc diff command in #1778 ! 🎉 🚀 Currently it shows a difference in a number of files and their sizes between git revisions. We could add actual data diff functionality on top of it 🙂 dvc diff is going to be released in 0.35.1 today, please feel free to give it a try! @efiop

Copied from Discord, qna channel:

I think what the comment there meant to me, and what my use case needs, is a data diff. Let's say I changed the dataset, then committed the change. How do I know what changed? Right now dvc diff only shows what files changed. It would've been so beneficial if it shows something like git diff with a line by line diff

jorgeorpinel commented 4 years ago

Hi @ammarasmro

It would've been so beneficial if it shows something like git diff with a line by line diff

Right, this is discussed above. There are probably thousands of data formats so this is not easy as with regular diff which only works on plain text files. Do you have a more specific use case? E.g. a certain data format or a set of formats? Thanks

ammarasmro commented 4 years ago

So the particular use case that raised this issue was a text dataset. The process we, the ML team, have used, is that after the data team gets data in any format, they process it and we get it as CSV, TSV, TXT, JSON, .md files. So alot of our experiments at least start with a text data file. We also use other formats like pickle but I'm guessing diffing would be more complicated than text.

stefanocoretta commented 4 years ago

Would be great to have line-by-line comparison at least for "text-based" files (TXT, CSV, JSON, MD, YAML, TOML, XLM, ...). By the way, great job with dvc, it's making my life easier!

jorgeorpinel commented 4 years ago

@ammarasmro and @stefanocoretta for now you can refer to the https://github.com/iterative/dvc/issues/770#issuecomment-512693256 above for a procedure to do that. It's not clear that we want such a feature for really large files, even when they're plaintext. But yes, maybe!

alex000kim commented 2 years ago

a request from a high-priority client:

dvc get --rev commit_sha -o myfile.csv.temp . myfile.csv
diff myfile.csv.temp myfile.csv
jorgeorpinel commented 2 years ago

Should it fall within DVC's feature set? Maybe we can find better diff tools to recommend instead.

alex000kim commented 2 years ago

The request was mostly about an easy way to compare any text files versioned by DVC and not so much about showing data format-specific diff. I presume many people would expect dvc diff to work similarly to git diff, but for files versioned by DVC.

jorgeorpinel commented 2 years ago

people would expect dvc diff to work similarly to git diff, but for files versioned by DVC

Maybe new users who are familiar with Git. But would DS/ ML Engs expect a printed line-to-line comparison of huge dataset versions?

Maybe dvc diff --targets can auto-detect small text files (or/and have another flag and/or config param for this) and print a diff instead of/ in addition to a summary. Cc @dberenbaum

dberenbaum commented 2 years ago

It makes sense to me as a feature request. Probably not as the default, but if people specifically want to see the diff of the contents of the files, I see no reason we shouldn't let them. It's more an issue of warning the users it could be expensive and/or unhelpful, and I'm not sure when we could prioritize this since there are pretty straightforward workarounds.

jorgeorpinel commented 2 years ago

In terms of implementation, maybe some sort of hidden DVC experiments (custom Git refs) can be used and then literally git diff the file versions internally.

pmrowla commented 2 years ago

We shouldn't be staging things in git that are supposed to be gitignored, even in hidden refs.

I think what we want here is just something equivalent to git's .gitattributes + difftool support: https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes

So you would tell DVC what external diff tool you want it to use for what file extensions. And then based on file extension, we would just automatically call the correct tool depending on what file the user is trying to diff. This also lets you extend support for diffing beyond text files. So you could do things like pass image formats into gui image diffing tools, and arbitrary binary files into binary diff tools, etc

dberenbaum commented 2 years ago

@alex000kim If these are only single-entry .dvc files, they can get text diffs with git diff using the external diff approach above.

Create dvcfile2text.py at the root of the repo that looks like:

import ruamel.yaml as yaml
import sys

from dvc.repo import Repo

dvcfile = sys.argv[1]

with open(dvcfile) as f:
    content = yaml.safe_load(f)
md5 = content["outs"][0]["md5"]

repo = Repo()
path = repo.odb.local.oid_to_path(md5)
with open(path) as f:
    print(f.read())

The add to .git/config:

[diff "dvcfile"]
    textconv = python dvcfile2text.py
    cachetextconv = true

And finally add to .gitattributes:

*.dvc diff=dvcfile

This is hacky and won't work for directories, but a dev could probably make it robust and working with directories without too much effort.


We could consider providing external diff drivers for git that diff .dvc and dvc.lock files, starting with showing output similar to what dvc diff does today (except it would be shown inside git diff). Not something that's likely to be a high priority right now, though.

alex000kim commented 2 years ago

Great, thanks for the example, however hacky :)

Not something that's likely to be a high priority right now, though.

Understood.

respatialized commented 3 months ago

A relatively simple way to support tooling like this without worrying about plugins for DVC itself would be to allow dvc get to redirect the file contents to stdout instead of copying them to the working directory. Here's an example that doesn't work, but demonstrates the workflow that would be possible, using jd for JSON diffs:

dvc get --rev REV-HASH -o /dev/stdout . data/output.json --force | jd data/output.json

Right now this command fails because it's attempting to write a temp file to /dev and can't. It seems like it could be made to work without relying on /dev/stdout, though.