googledatalab / datalab

Interactive tools and developer experiences for Big Data on Google Cloud Platform.
Apache License 2.0
974 stars 249 forks source link

Consider installing conda by default #1376

Closed lakshmanok closed 6 years ago

lakshmanok commented 7 years ago

A number of scientific packages are not installable by PyPI, but are instead installed using conda/minconda. It would be very helpful if conda were present in the Docker image by default.

See also: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/conda

nikhilk commented 7 years ago

We may want to consider conda as the way to get all of our packages and dependencies (including Python), i.e. go beyond just having it be present in the image.

yan-hic commented 7 years ago

@nikhilk has this gotten any traction ? I want to install fastparquet but it fails with pip.

umang-sh commented 7 years ago

I would like to work on this pull request

yebrahim commented 7 years ago

@umang-sh Please feel free.

nikhilk commented 7 years ago

Note that this is likely a significant change/cleanup to our docker image starting with how python is setup. As we'll want to use python via miniconda.

Here is my example of a minimal (but old) miniconda setup - https://github.com/nikhilk/containers/blob/master/ipython/Dockerfile

umang-sh commented 7 years ago

@nikhilk @yebrahim @chmeyers I am taking this up and starting work on it , will comment in case I need any help/info :) This is gonna be a major change indeed

umang-sh commented 7 years ago

@nikhilk @chmeyers @yebrahim Hi Guys, Few questions arised while I was working on this :
1.All the python dependencies are in setup.py in pydatalab right? 2.With this we want to shift those dependencies to docker file via conda.So setup.py may or may not be used,since conda will take care.?

Also let me know what approach do you guys foresee other than this?

Thanks, Umang

nikhilk commented 7 years ago

Setup.py is about installing the library anywhere, so it should continue to have the required set of dependencies.

I believe we'll need to completely redo how the docker image is built to use conda instead of what is done right now.

nikhilk commented 7 years ago

Adding reference to Stack Overflow Q - https://stackoverflow.com/questions/47025059/install-conda-package-from-google-datalab

d-wasserman commented 6 years ago

Thanks for the reference, I found this thread this way. I would also be interested to see this as a feature. Is someone still working on a PR for this?

umang-sh commented 6 years ago

@Holisticnature I am working on the PR for this :)

umang-sh commented 6 years ago

@nikhilk From what I understand of the current docker image flow. build.sh in different directories and prepare.sh and run.sh are the crucial files for the build and among these files , I only see use of pip in run.sh for installing pydatalab and no install of any other python dependency in the flow. Correct me if I am wrong. if we build the dockerfile with conda and add it properly with the current flow that would work right ?or is there any other script file we need to consider as well. Once this is clear I will start building the Dockerfile with the correct flow.

Thanks :)

yan-hic commented 6 years ago

Just released, https://research.google.com/colaboratory/unregistered.html

Not tested out yet but would anyone know if allows to install packages and if so: pip or conda ?

On Nov 8, 2017, at 12:46 PM, umang-sh notifications@github.com wrote:

@nikhilk From what I understand of the current docker image flow. build.sh in different directories and prepare.sh and run.sh are the crucial files for the build and among these files , I only see use of pip in run.sh for installing pydatalab . Correct me if I am wrong. if we build the dockerfile with conda and add it properly with the current flow that would work right ?or is there any other script file we need to consider as well. Once this is clear I will start building the Dockerfile with the correct flow.

Thanks :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

nikhilk commented 6 years ago

Moving to conda entails re-doing the apt-get and pip install steps in the docker file (eg. https://github.com/googledatalab/datalab/blob/master/containers/base/Dockerfile). To get a sense of what I am alluding to, look at the definitions in https://github.com/nikhilk/containers where conda is used to install various packages.

MaxGhenis commented 6 years ago

@yiga2: Google Colaboratory supports package installation via pip, not conda.

yan-hic commented 6 years ago

Thanks Max. Well then, hope the effort of switching/adding conda to datalab could extend to Colaboratory...

AFAIAC I am a happy camper as fastparquet is now part of pandas 0.21 so 'pippable'

On Nov 8, 2017, at 7:31 PM, Max Ghenis notifications@github.com wrote:

@yiga2: Google Colaboratory allows package installation via pip, not conda.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

d-wasserman commented 6 years ago

I would appreciate Collaboratory support for conda as well. I am finding that pip is getting me 95% of what I need, but some of the support libraries in the UDST toolkit don't make it for example...

MaxGhenis commented 6 years ago

As some broader context, Microsoft Azure Notebooks run Anaconda and Jupyter "strongly recommend[s] installing Python and Jupyter using the Anaconda Distribution" on its install page.

umang-sh commented 6 years ago

Hi Guys , Sorry for the long Delay.

I have made a sample DockerFile with this change. Please have a look at it here.

https://github.com/umang-sh/containers

@nikhilk @yebrahim Thanks

rileyjbauer commented 6 years ago

Hello!

@umang-sh Thank you for working on this! I'm a new member of the Datalab team, and I've been taking a look at actually migrating away from pip to Conda for a little while now. I have a close to functional branch, but sorting out all of the dependency and environment issues for both Python 2 and 3 is looking like it's going to require some fairly significant changes to the Dockerfile as well as some minor changes to other scripts involved in the build process which I'm trying to fit within a single refactor commit. To that end, I'm going to reassign this issue to myself.

rileyjbauer commented 6 years ago

Closing this issue as conda support should be working as of #1923

MaxGhenis commented 6 years ago

Awesome!! Is there documentation on this yet? I haven't tried it, but would be interested in knowing how to install a package from conda, e.g. https://stackoverflow.com/q/47025059.

MaxGhenis commented 6 years ago

Checking in here, as the help page for Adding Python libraries to a Cloud Datalab instance doesn't mention conda.

MaxGhenis commented 6 years ago

This worked, though it was pretty slow:

!conda install -c ospc taxcalc --yes

--yes is needed to bypass the prompt asking to install dependencies.

GitTorres commented 6 years ago

I'm late, but thanks for checking that @MaxGhenis .

Will there be documentation on using conda with datalab sometime? It would be good to know what the limitations are, such as whether (and how) we can run conda commands in a docker container shell instead of jupyter notebook.

edit: I found out how to access the docker shell using this link: Working With Notebooks. Certain commands still don't work, such as conda update conda, which yields an http error.