rocker-org / ml

experimental machine learning container
GNU General Public License v2.0
50 stars 13 forks source link

reticulate python rstudio #20

Open ryangarner opened 5 years ago

ryangarner commented 5 years ago

Can we get rocker/ml and rocker/ml-gpu to work with reticulate on RStudio?

library(reticulate)
py_config()

python: /usr/local/bin/python libpython: /usr/lib/python3.5/config-3.5m-x86_64-linux-gnu/libpython3.5.so pythonhome: /usr:/usr version: 3.5.3 (default, Sep 27 2018, 17:25:39) [GCC 6.3.0 20170516] numpy: /usr/local/lib/python3.5/dist-packages/numpy numpy_version: 1.16.3

python versions found: /usr/local/bin/python /usr/bin/python /usr/bin/python3

py_install('pandas')

Error: Prerequisites for installing Python packages not available.

Please install the following Python packages before proceeding: pip, virtualenv

cboettig commented 5 years ago

@ryangarner Thanks for opening this issue, yeah, it would be nice if py_install would work out of the box. Note that reticulate is in fact already installed, as is pip3, but the python packages are installed system-wide using pip3 directly and not using virtualenv, which isn't much help for the user trying to install additional packages from R.

@noamross would love your thoughts on how best to go about this. In particular, as you know Debian/Ubuntu use separate namespaces for python (2.7) and python3, and I haven't figured out how to get the reticulate functions from R to use the python3 versions for everything. We have both python versions installed on the image (actually RStudio pulls in both versions now), so while we could do something like symlink ln -s /usr/bin/pip3/ /usr/local/bin/pip, ln -s /usr/bin/python3/ /usr/local/bin/python, I'm not sure that's a good idea, and I'm not entirely sure what to do to so that reticulate will find python3-virtualenv instead of the python-virtualenv after installing it.

@choldgraf could probably set me straight on the best way to go about the python virtualenv setup here.

choldgraf commented 5 years ago

hmm - is the main question "how are environments set up with virtualenv in Python?" - e.g., is this a file paths problem?

cboettig commented 5 years ago

Thanks Chris, I guess this is really two questions:

Q1. What's the best way to set up a Python3 environment for Docker images?

As you know, ubuntu/debian distros expect users to explicitly request python3, calling just python, pip all mean Python 2. the default behavior of reticulate is to look for python and pip binaries, i.e. use python 2. Presumably this can be changed in reticulate config (e.g. use_python("/usr/bin/python3"), but I don't think that updates the paths for pip installs. Alternately we could go the symlink route. Note the official tensorflow Dockerfiles make this configurable in build args, but also symlink python3 to /usr/local/bin/python so that it works without the 3, though I'm not sure why they choose to do so.

Q2. What's the best choice for managing python environments in our context -- pip, virtualenv, or conda? (and how do we get those working in python3 instead of python2 on debian?)

reticulate is happy to use any of these options. Currently we're just going pure pip, but then users cannot install additional packages without root. I suspect we should set things up to use virtualenv, though this raises a series of additional questions: (a) how do you get reticulate to use python3 when creating a virtualenv mode? (b) What's the best choice of home path for the virtualenv (e.g. we would at least like the same python env to be available to root and non-root users), and (c) is virtualenv the best choice at all here? (e.g. Nick tells me we'd get better tensorflow performance using conda with intel MKL instead).

choldgraf commented 5 years ago

sorry for the slow response - I'm actually not a super expert on python paths so may not be the best person to ask, but my undersatnding is:

The simplest for generic data science workflows that might not involve Python packages is to us miniconda to handle environments, along with the conda-forge channel for Anaconda. The other option is to use virtualenv and system python w/ pip...it's much more light-weight, though it can be non-trivial to install certain kinds of packages (e.g. mapping packages that require non-python dependencies like fiona). You might get some inspiration from the base repo2docker template here: https://github.com/jupyter/repo2docker/blob/master/repo2docker/buildpacks/base.py#L14

I don't believe that you must use pip with root privileges. Couldn't you install using the --user flag? That'd install to a user directory instead of root.

@yuvipanda might have some ideas for the best path forward here as well!

ps: for the MKL stuff, that might be the case...I've had differing results using MKL vs. BLAS for linear algebra stuff - I think it depends a lot on the specific computation you're running

yuvipanda commented 5 years ago

If you're already using system python, my recommendation is:

  1. early on in the dockerfile, create a virtualenv (as root) - python3 -m venv /opt/venv
  2. Change the ownership of /opt/venv to your regular user, so they can install packages into it without extra effort. chown -R rstudio:rstudio /opt/venv.
  3. Modify PATH to include the 'bin' directory inside the virtualenv. This will make python, pip etc default to using the python inside the virtualenv, and hence python3. ENV PATH=/opt/venv/bin:${PATH}
  4. Install whatever base packages you want into this virtualenv (as your normal user): python3 -m pip install --no-cache-dir <packages> or python3 -m pip install --no-cache-dir -r requirements.txt. The --no-cache-dir helps reduce the size of your docker image. Note that this must be done as your normal user - accidentally doing this as root will cause issues.

This should work for 99% of use cases. The big reason to move away from this is if you want to use a version of python different from what is provided by your system python. If you need to use a newer version of python, my recommendation is to use miniconda to get just python, but still use a virtualenv for everything else.

ryangarner commented 5 years ago

This is my quick fix Dockerfile to get reticulate to work properly. Hope this helps!

FROM rocker/ml-gpu

RUN apt-get update
RUN apt-get upgrade -y
RUN apt-get install curl -y
RUN curl -O https://bootstrap.pypa.io/get-pip.py
RUN python get-pip.py
RUN apt-get install python-virtualenv -y
RUN pip install virtualenv --upgrade
cboettig commented 5 years ago

@ryangarner thanks. Yup, doing apt-get install python-virtualenv will I believe install python2 version, as I've commented above. I think you could condense your version into

RUN apt-get update && apt-get -y install python-virtualenv python-pip

(note that in general you want to have apt-get update and apt-get install on the same line in Dockerfiles and avoid upgrade to play nicely caching).

If you wanted to stick with the python3 versions (Tensorflow plans to deprecate python 2 in the next year anyway) you'd do

RUN apt-get update && apt-get -y install python3-virtualenv python3-pip

but reticulate won't find pip or virtualenv then.

I quite like @yuvipanda 's proposed workflow above, so I'll give a stab at that. In particular, it sounds like step 3 will make python == python3? Yuvi, is there any risk of that messing up other things that are using python2?

yuvipanda commented 5 years ago

@cboettig I made #21

yuvipanda commented 5 years ago

@cboettig it shouldn't mess anything up, since it's only for things that run with the specific PATH set (so things started by the user in this container). This is also how mybinder.org runs (python refers to python3 there), so I think it's ok!

cboettig commented 5 years ago

@ryangarner if you use

reticulate::virtualenv_install("/opt/venv", "pandas") 

things should work as expected. you may want to set reticulate::use_virtualenv("/opt/venv")

Not sure what is up with py_install() since it should basically be calling use_virtualenv under the hood, but somehow it's error handler is checking and failing to find the virutalenv first. Still investigating...

yuvipanda commented 5 years ago

linking https://github.com/rstudio/reticulate/issues/496 as related.

cboettig commented 5 years ago

thanks Yuvi! Digging a bit more this seems to be a problem in the reticulate source code inside py_install(), which assumes binaries are in ("/usr/bin", "/usr/local/bin", path.expand("~/.local/bin")) and not PATH. I've opened a separate issue here: https://github.com/rstudio/reticulate/issues/499#issuecomment-491643997