iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.62k stars 1.18k forks source link

Cyclic dependencies #296

Closed kskyten closed 6 years ago

kskyten commented 6 years ago

Thanks for making dvc. It looks really interesting.

I'm trying to use Bayesian optimization to optimize the hyperparameters of a model. This creates a cyclic dependency in the workflow. Executing dvc run python code.py multiple times with the following minimal example updates the state of the file 'data/res.txt', but reverting back to an older commit does not revert the changes. I think adding cyclic dependencies would be a useful feature to have. What kind of changes would be needed to add it?

# increment the count in the file 'data/res.txt' by one
with open('data/res.txt', 'r+') as f:
    n = int(f.read()) + 1
    f.seek(0)
    f.write(str(n))
    f.truncate()
dmpetrov commented 6 years ago

Hi @kskyten, thank you for your kind words about DVC.

You have an interesting scenario - like auto-ML. We designed DVC with this kind of scenarios in mind. Not all of our ideas were implemented in the first version (the current released version) - more scenarios are coming in the next version.

First, let's discuss how can we avoid your issue. By your current code you create cyclic dependencies with symlinks: data/res.txt version N --> data/res.txt version N-1 --> data/res.txt version (N-2) --> ... --> the current data/cache file. So, you are actually having a right symlink with the right version when you do git checkout PREV_COMMIT but it still points to the latest data/cache file.

To avoid the issue you should not write to an existing symlink file directly. I'd read the data file first, delete the file and increment it later. The git checkout will work properly. Do not forget dvc repro because dvc remove removes cache files by default.

mkdir myrepo_old/
cd myrepo_old/
git init
dvc init

dvc run echo "0" --stdout data/res.txt
cat data/res.txt
# 0 expected

# A single increment
VAL=`cat data/res.txt`
dvc remove data/res.txt  # Keep cache. So, you won't need dvc repro
dvc run echo $(($VAL + 1)) --stdout data/res.txt
cat data/res.txt
#  1 excepted

# A batch increment
for num in {0..32}; do
    VAL=`cat data/res.txt`
    dvc remove data/res.txt
    dvc run echo $(($VAL + 1)) --stdout data/res.txt
done
cat data/res.txt
# 34 expected

# Git checkout works!
git checkout HEAD~12 -b alpha_optimum
dvc repro
cat data/res.txt
# 28 expected

I'd appreciate if you could share more details. This will help us to support more scenarios.

  1. What is your goal in this step? Are you trying to find the optimum value for some metrics? Is the metric located in a different data file?
  2. Have you considered creating branches for each of the iterations? Like alpha_iter_0, .. alpha_iter_34.

Thank you for using DVC.

kskyten commented 6 years ago

Thanks for the great response. I now realize that my example was a bit naive. Unfortunately, I couldn't get your example working. I got the following error: Config file error: can't find aws credentials. Is it possible to run DVC locally without AWS?

A typical use for Bayesian optimization in machine learning is to optimize the hyperparameters of a model (i.e tune the parameters of your model to minimize the test set error). Here's some more information on the subject: Practical Bayesian Optimization of Machine Learning Algorithms.

My specific use case is applying Bayesian optimization for simulator based statistical models (BOLFI). These statistical models contain arbitrary simulator code so encoding the statistical models as workflows with user provided components would be great. I am one of the developers of ELFI, which is a framework for likelihood free inference. I think building a likelihood free inference library on top of something like DVC would provide more flexibility, better provenance data and make collaboration between researchers easier.

I am also very excited about DataLad, which is similar to DVC but the focus is more on data instead of workflow. DataLad is built on top of git-annex, which was mentioned in an other issue (#211). In my particular case, I would like to use DataLad for distributing the different components of the statistical models in a collaborative way. It also has a way of reproducing results similar to DVC, but the workflow is encoded in the git graph so as far as I know it is not possible to change a component in the middle and re-run the same workflow. Ideally, I would like to create a new branch when a component is changed and then re-run the workflow. It would awesome, if it were possible to use DVC and DataLad interoperably.

I also looked at Pachyderm, which is quite interesting, but not entirely suitable for my purposes. Pachyderm defines the workflows explicitly..

Using branches/tags for each iteration sounds like a good idea.

dmpetrov commented 6 years ago

Thank you for sharing that. ELFI looks very interesting. I'll definitely keep an eye on this project and will try to use when I have a chance.

Issues first... The Aws requirements - is an issue in the old version. It will be fixed in the coming release. You can easily mitigate the issue by creating empty credentials. DVC works fine without any cloud.

mkdir ~/.aws
printf "[default]\naws_access_key_id =\naws_secret_access_key =\n" > credentials

For Windows create C:\Users\%USERNAME%\.aws\credentials file.

Auto-ML support. I can show you what is going to be implemented in the next DVC version. And I'd love to know your opinion about our approach. Will this approach solve your problem? And how we can improve that?

In general - I'd like to keep all experiments in separate branches and then incorporate the best result into the major model (major model = master branch, for example) or skip all the result and keep moving with other parameter tuning experiments.

# Try alphas from 0.01 to 0.09 and keep them in branches 'alpha_0X'
for i in {1..9}; do
    ALPHA="0.0$i" # make it 0.0, 0.01, 0.02 ... 0.09
    # Change worflow in a separate new branch
    dvc run python mycode.py $ALPHA data/input.csv data/output.txt --new-branch alpha_0${i}
    # Rerun workflow in an existing branch
    dvc repro data/evel_auc.txt --branch alpha_0${i}
done

# Look at the target metric. We assume the target file contains a single float number.
dvc find all --branch-regex  'alpha_0.' --show-metrics data/eval.txt
# alpha_01: 0.671834
# alpha_02: 0.671917
# alpha_03: 0.672381
# ...

# Find a branch with the best target metric in a programmatic way
best_alpha_branch=`dvc find max --branch-regex  'alpha_0+' data/eval.txt`
echo $best_alpha_branch
# > alpha_06

# Merge the best result into "master"
git merge X --theirs $best_alpha_branch
dvc repro # Reproduction is actually not needed since the result was already produced

We've implemented a visualization tool to support this scenario - see the graph attached.

The other tools comparison.

DataLad looks interesting and git-annex has a lot of good ideas inside. Pachyderm is quite mature in this space but this is a data engineering tool, not modeling.

Before building the new DVC version (which is coming soon), we analyzed a lot of different approaches to data versioning. Our conclusion could be summarized as follow: 1) we'd like to use git-annex model (the current DVC model) to store data (in caches) but git-lfs interface to big/data files (like a native git). 2) Using a special git server (as git-lfs does) is not the best idea. DVC repositories have to be git-compatible. We need to use the cloud for data (instead of "special" git-servers). 3) We don't want to copy data from caches to working area as git and git-lfs does - it is painful for 5Gb+ files. Also, symlinks is not the best choose. So, we replaced symlinks to hardlinks. 4) We need to get rid of predefined data directory (like git-lfs) but we don't want to use file name patterns like git-lfs does by .gitattributes.

All of the above is already implemented (in DVC master branch) in the coming version. So, you can think about the next DVC version as a mix of git-annex and git-lfs. Plus reproducibility - we keep the old model with some minor improvements. Cloud computing (running code in a cloud) will be implemented later - not the next release, unfortunately.

New DVC generates workflow graphs Sorry, this one is not the best - too detailed. workflow-5exp

kskyten commented 6 years ago

Bayesian optimization Yes, I believe this will work. I'll have to experiment with it. The workflow you described seems actually more like grid search. In Bayesian optimization you are simultaneusly learning a black-box function and minimizing it, so typically you would want all the possible data (i.e. the last iteration). There is no need to search for the optimal value. Because of this, a linear history is more natural, but separating the iterations to their own branch(es) seems like a good idea.

Tools Datalad and git-annex have a lot of nice features, especially support for multiple storage backends, metadata and convenient subdatasets (repos inside repos). From what I can see, the only drawback of git-annex is the limited support for windows (same problem with DVC?) due to the use of symlinks. What is the advantage of using hardlinks compared to symlinks? I wonder if it is feasible to have DVC be compatible with git-annex and Datalad, so that you could use their data repositories to fetch the data and then use DVC for the workflow.

It seems that you can use a git-lfs like interface in the latest version of git-annex with the correct configuration.

Parallel computing In my use case, the computations are typically run in a HPC setting using something like SLURM. Typically, you would ssh to a login node, which is a computer in the cluster that you use to set up the computation. On the login node, you would run a command to queue your task for computation. This command could be recorded in DVC. The tricky part is handling the output of the computation. I'm not familiar with HPC enough to tell if DVC would work in this setting. I think the results are stored on a shared disk that you can access from the login node, so potentially it could work.

dmpetrov commented 6 years ago

Bayesian optimization. I think I got your scenario. Yes, a few separate branches (branch per experiment) is probably a better fit for gradient-based methods. DVC can handle your linear history scenario and visualize it. It would be great if you can share more details - we are always happy to implement new features.

Tools. DVC supports Windows - we have a custom implementation of symlink as well as hardlink for Windows. Windows is one of risky part of git-annex. Interfaces to the cloud is another one - there are so many assumptions in git-annex that no natural: it works only with buckets (like mybucket) although a "bucket directory" (like mybucket/classifiers/cat-dog) is more practical, it uses internal format (with compressing and internal names) although a more transparent format might be a better fit (keep sync files as is). git-lfs interface has also some issues - data file is a special substance which cannot be identified by file name or file size (.gitattributes basically).

Anyway, the current internal DVC sync is just a file back-end which can be relatively easy replaced by git-annex or even git-lfs. But you brought a very good point towards to git-annex support - existing Datalad repositories and submodules #301 (supported by default in git-annex and not supported in the DVC file backend yet). We should revisit our vision about the file backend. I'll let you know a bit later.

Your parallel computing scenario matches our vision with few small additions - you should initiate the communication (run commands) from your local DVC (dvc run --cloud-queue gpu-2gb python myscript.py input.p output.p --epoch 20). Also, I believe p2p communication is an essential part of this scenario and repositories sync (pushs and pulls, think git annex sync --content) is a good foundation for the communication. This is going to be a subject of the following DVC release (early next year), not next one (December).

And thank you for the new feature requests you've created and the feedback. We really appreciate your feedback! Please feel free to share any other comments that you have. We are at an early stage and new good ideas could be easily incorporated into the DVC DNA.

dmpetrov commented 6 years ago

It seems like "Cyclic dependencies" was not the root cause of the issue. Closing...

Please feel free to reopen if I missed something.

yarikoptic commented 6 years ago

I hope you do not mind me chiming it with some clarifications on the previous discussion topics. re @kskyten 's I would like to use DataLad for distributing the different components of the statistical models in a collaborative way. It also has a way of reproducing results similar to DVC, but the workflow is encoded in the git graph so as far as I know it is not possible to change a component in the middle and re-run the same workflow. Ideally, I would like to create a new branch when a component is changed and then re-run the workflow. If a component is a script which you run, then you could change it and commit, and use datalad rerun without specifying --onto to get your recorded by datalad run commands reran. --branch option could be useful as a shortcut to establish a new branch.

re @dmpetrov Windows is one of risky part of git-annex. Agreed, although I would say "tricky" not "risky" ;) There are 3 solutions (suggesting that none is perfect yet):

re @dmpetrov git-annex that no natural: it works only with buckets (like mybucket) although a "bucket directory" (like mybucket/classifiers/cat-dog) is more practical add fileprefix=classifiers/cat-dog/ to the git annex initremote call

re it uses internal format (with compressing and internal names) although a more transparent format might be a better fit (keep sync files as is)

re What is the advantage of using hardlinks compared to symlinks? Some tools/browsers (file finder in OSX?) follow the symlinks, so in case of git-annex repositories you end up deep under .git/annex/objects. cons: hardlinks are impossible across users; cannot "lock" (prevent editing) one hardlinked file while leaving other writable and if you modify it - both are modified. The better technology is IMHO CoW file systems such as BTRFS. git annex "utilizes" it (cp --reflink=auto ...) while getting content on the local drive.

re if it were possible to use DVC and DataLad interoperably Well -- you could still use DataLad as a git/git-annex frontend while using dvc to track the workflow. I guess, similar to our rerun --script (extracts all ran/recorded commands as a shell script) someone could provide another type of "export" generating all the .dvc and Dvcfile s.

Hope this all comes useful/informative to some degree ;-) Cheers and keep on good work!

dmpetrov commented 5 years ago

Thank you @yarikoptic! I'm very sorry for the delay - this thread was lost since the issue was closed.

Your clarifications are very helpful. Now (after a year) I agree about the Windows part as well as the AWS bucket part. However, don't quite agree with the hardlink vs. symlink part but DVC supports CoW (by cache.type = reflink) and, yes, this is a better technology.

I like the idea of using DataLad (Git annex) as file storage and DVC to track workflow. It might be helpful to chat more about DataLad and use cases. Please let me know if you have time for a chat.

Thank you again for the feedback and the clarifications!

yarikoptic commented 5 years ago

I will be happy to chat! and I guess other DataLad folks as well. We typically have our DataLad Jitsi meeting each Friday 9am EST (but not next Friday for me at least, traveling), https://meet.jit.si/DataLad . You are most welcome to join if time works for you, just let me know which date so I am there for sure -- don't want to miss it. Otherwise, we will find time I bet (the same Jitsi url would work any time).

dmpetrov commented 5 years ago

@yarikoptic great! 9am EST is a bit early in my time zone. My next week schedule is fairly open. How about a chat after lunch time (1pm-5pm EST) on Monday, Tuesday or Friday? Please let me know what is the best time for you and we will meet in your chat room.

yarikoptic commented 5 years ago

Sorry that I wasn't complete - traveling entire next week. But the week after, if your availability holds, Tuesday 15th should work, although probably not for the other half of the team - Germans (@mih?). Should we schedule preliminary for 1pm

dmpetrov commented 5 years ago

Sure, 15th at 1pm (EST) works for me. Please let me know if anything changes.