iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.41k stars 1.16k forks source link

Create an installation package for conda package manager #120

Closed gvyshnya closed 4 years ago

gvyshnya commented 6 years ago

Anaconda (https://www.continuum.io/what-is-anaconda) is the leading Python distribution for data science today. It has its internal package manager - conda (https://conda.io/docs/index.html), which is a rival to a well-known pip.

Since Anaconda as well as its python-only lightweight version of Miniconda (https://conda.io/miniconda.html) are getting more and more tracking within Data Science community these days, porting DVC installer to conda may become a good step to streamline DVC usage across industrial analytical circles.

efiop commented 6 years ago

Hi @gvyshnya !

Thank you for your feedback! Anaconda actually was our first guess when we were developing installers for dvc(you can actually see traces of it in git log), but considering that dvc is currently more of a standalone utility, we actually opted in favor of pyinstaller to create a standalone binary for dvc and distribute it in usual packages(rpm,dev,exe), and pip to distribute it as a python package. That being said, we actually were thinking of creating anaconda/miniconda package in the future, when dvc will be more fit to be used as a library. We can now see that there is a clear demand for it and will try to deliver it in the near future.

Casyfill commented 6 years ago

looking forward to conda support!

efiop commented 6 years ago

Fixed https://github.com/dataversioncontrol/dvc/commit/79d710010471293a1184d1e66e7619c9bcc00ea0 the issue with download_url/url fields in our package info that didn't allow me to use conda skeleton pypi dvc on 0.9.5. This fix will be released in 0.9.6 and I'll be sure to get back to creating conda package right after 0.9.6 is published on pypi.

efiop commented 6 years ago

Creating a conda package for dvc requires creating packages for all dependencies, as meta.yaml doesn't support pip dependencies for conda packages, only for environments. Thus making creating conda package for dvc time-consuming and tedious. If anyone from the community feels like working on it, please feel free to do so. For now, considering that we provide (among others) a pip package, which can be specified in conda env as a dependency, I don't see a real need in creating conda package right now and might revisit this issue in releases after 0.9.7.

efiop commented 5 years ago

Closing as stale. Please feel free to reopen if you feel like working on this.

yfarjoun commented 5 years ago

Conda seems to have better support for creating identical and consistent environments on different platforms. For example, my development env is OSX (my laptop) but production is Ubuntu linux. I need to make sure that there are no differences in the packages installed on the two environments and that I am able to easily spin up a new machine with the same packages...

tfenne commented 5 years ago

I agree with @yfarjoun. There are a few reasons why it would be really nice to have a recipe for dvc in one of the main conda channels:

  1. Convenience. In my projects I prefer to create reproducible environments with conda. While one can obviously install packages using pip into an environment created by conda that's both significantly less convenient (and more awkward to automate) and makes it much harder to generate reproducible environments.
  2. Reproducibility. Conda was, as I understand it, largely invented because the existing package management solutions in python space (including pip) did not provide ways to make fully reproducible environments. Conda now includes many non-python packages, and is largely the default way to install native (e.g. C, C++) bioinformatics packages as well as python installations and packages. It is much tougher to make a reproducible environment where conda does 90% of the setup and pip then installs packages. This is particularly difficult when the pip packages drag in a lot of dependencies and some of those are shared with packages already installed via conda. Since running pip dvc[s3] in a bare environment installs 38 packages, that's quite challenging.
  3. Irony? Sorry if this is too tongue-in-cheek, but it just seems ironic to me that a package whose goals are to provide reproducibility in data science is installed in ways that make reproducibility of the installation difficult!
efiop commented 5 years ago

@yfarjoun @tfenne Thank you guys for all the feedback! We really appreciate it! Reopening this issue :slightly_smiling_face:

Guys, btw, could you elaborate on why is using

dependencies:
  - pip:
    - dvc==0.32.1

in your conda env not reproducible?

tfenne commented 5 years ago

Thanks @efiop. This is essentially the strategy I'm using, but it's a bit more complicated than that. What that section actually ends up looking like is more like this:

  - pip:
    - appdirs==1.4.3
    - asciimatics==1.10.0
    - boto3==1.7.4
    - botocore==1.10.84
    - chardet==3.0.4
    - colorama==0.4.1
    - configobj==5.0.6
    - configparser==3.7.3
    - contextlib2==0.5.5
    - decorator==4.4.0
    - distro==1.4.0
    - docutils==0.14
    - dvc==0.32.1
    - future==0.17.1
    - gitdb2==2.0.5
    - gitpython==2.1.11
    - grandalf==0.6
    - idna==2.8
    - jmespath==0.9.4
    - jsonpath-rw==1.4.0
    - msgpack==0.6.0
    - nanotime==0.5.2
    - networkx==2.2
    - ply==3.11
    - pyasn1==0.4.5
    - pyfiglet==0.8.post1
    - requests==2.21.0
    - s3transfer==0.1.13
    - schema==0.7.0
    - smmap2==2.0.5
    - urllib3==1.24.1
    - wcwidth==0.1.7
    - zc.lockfile==1.4

... because without pinning the versions of all the dependencies, it's hard to guarantee reproducibility. Currently this is working because where dvc requires a package that is previously installed by conda (in my env) the version that's installed satisfies the requirement. But if it required an earlier or later version that would start to be difficult to manage.

J0 commented 5 years ago

@efiop just curious, is anyone actively working on this issue? If not, it seems like something I wouldn't mind working on over the next week.

efiop commented 5 years ago

@J0 That would be amazing! :slightly_smiling_face: No, no one is working on it right now. Thank you so much for looking into this!

brbarkley commented 5 years ago

FYI, the outstanding DVC dependencies that do not have a conda build are:

For DVC to provide a conda build, I believe the above packages will also need a conda build. See contributing packages guidelines on conda-forge. The process for porting a PyPi package to conda-forge is becoming increasingly streamlined but still not a trivial task.

I would like to see DVC on conda but currently do not have the time to assist on this issue.

ei-grad commented 5 years ago

Started to work on this. A basic meta.yaml for dvc is here - https://github.com/ei-grad/staged-recipes/blob/dvc/recipes/dvc/meta.yaml.

About dependencies:

@ei-grad: It is a bit unclear, if I want to add a package with dependencies which are not already on conda-forge, should I put this dependencies in the same pull-request with the package I want to add? Or should it be a separate PR for each dependency? @chrisburr: Both will work but you should consider: If the recipes are complex a separate PR will be easier to review If you do it in one PR the first feedstock build will fail due to missing dependencies so you'll have to restart it ~an hour later Multiple PRs can take longer to get reviewed

I guess it is better to put them in the same PR with the DVC.

Btw, @brbarkley could you please share how did you get the list of outstanding dependencies?

brbarkley commented 5 years ago

@ei-grad I manually went through DVC’s dependency list and searched for them on conda-forge.

ghost commented 5 years ago

looks like there's already a version on conda cloud: https://anaconda.org/derickl/dvc :eyes:

efiop commented 5 years ago

looks like there's already a version on conda cloud: https://anaconda.org/derickl/dvc 👀

that guy seems to have packaged everything that is needed. Including grandalf https://anaconda.org/derickl/grandalf .

ghost commented 5 years ago

@efiop , those are outside conda-forge (don't know if this is like the official distribution or something)

ryokugyu commented 5 years ago

@J0 That would be amazing! 🙂 No, no one is working on it right now. Thank you so much for looking into this!

Any update on it? @J0

efiop commented 5 years ago

Hi @derickl ! We've found your conda package for dvc and we were wondering if you would be willing to contribute your scripts to create an official dvc repo, that we could help maintaining and keeping up-to-date?

PeterFogh commented 5 years ago

Thanks to all of you working on this. It world be awesome to have a Conda dvc package, as I mainly use conda as package manager. However, I prefer if it is possible to have the dvc package in the main or conda-force channel.

GildedHonour commented 4 years ago

Help is needed on this, right? Whom can I discuss that with?

yfarjoun commented 4 years ago

I'm happy to talk as a user.

shcheklein commented 4 years ago

@GildedHonour we actually have a guy who is looking into this right now. Are you interested in helping us for this specific task or just want to be involved and help DVC in general? Would be happy to discuss and find more stuff where we need more hands :)

GildedHonour commented 4 years ago

@shcheklein in general too. Yes, let's discuss.

shcheklein commented 4 years ago

@GildedHonour Alex, can you find me and/or Ruslan on dvc.org/chat (ivan and ruslan)? would be happy to chat.

GildedHonour commented 4 years ago

@shcheklein just done

efiop commented 4 years ago

https://github.com/conda-forge/staged-recipes/pull/8963 was merged. Dvc should be available throug conda-forge now https://github.com/conda-forge/dvc-feedstock , unless I'm missing something. Big thanks to @MaxRis :tada:

maxhora commented 4 years ago

@efiop unfortunately, dvc package will be uploaded to conda-forge channel once we will have 1st successful ci build in feedstock repo's master https://github.com/conda-forge/dvc-feedstock/commits/master ( so far the build was failed because of others not yet uploaded dependencies ).

Another important thing is that only Python 2.7 and 3.6 builds are enabled for dvc feedstock. To enable Python 3.7 builds it will be needed to remove restriction from there https://github.com/conda-forge/dvc-feedstock/blob/master/recipe/meta.yaml#L14 , but before we can do that it's required to bring Python 3.7 based builds for all DVC's dependencies.

efiop commented 4 years ago

@MaxRis Thanks for the clarification! Let's keep this open for now then.

maxhora commented 4 years ago

DVC 0.53.2 for Python 2.7 and 3.6 is available through conda-forge now!

conda install -c conda-forge dvc

maxhora commented 4 years ago

Python 3.7 build of dvc is available now!

Odd thing is that on Windows 10 I'm receiving following error when trying to run installed dvc from conda-forge:

Fatal error in launcher: Unable to create process using '"c:\bld\dvc_1564563047081\_h_env\python.exe"  "C:\Users\max\Miniconda3\Scripts\dvc.exe" '

Will try to investigate this more.

maxhora commented 4 years ago

Finally, dvc 0.54.1 build 1 with all extra deps is available in conda-forge

shcheklein commented 4 years ago

@MaxRis awesome stuff! Thanks. The only thing is the doc on how do we support/update it in the future before we close this ticket (finally).

shcheklein commented 4 years ago

k, thanks, @MaxRis, we have all the docs ready now - https://github.com/iterative/dvc/wiki/Maintenance-of-Anaconda-package-in-conda-forge-channel

@efiop please, take a look and let's update our release check list to include a step to upgrade requirements is necessary.

I think we are ready to close this issue at last 🎉

efiop commented 4 years ago

@shcheklein Added a quick one https://github.com/iterative/dvc/wiki/Release-checklist

shcheklein commented 4 years ago

thanks @efiop 🙏 :)