robdmc / pandashells

:panda_face: Bringing the python data stack to the shell prompt
Other
788 stars 26 forks source link

Docker container environment. #22

Open sthysel opened 9 years ago

sthysel commented 9 years ago

Docker container environment that provides all dependencies for pandasshells.

robdmc commented 9 years ago

Thanks for the PR. I haven't worked much with Docker. Would you know how to go about implementing a solution for issue #26

sthysel commented 9 years ago

For X access I typically use the -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=unix$DISPLAY pattern So with the existing Dockerfile (in PR)

docker run -it -v ${pwd}/data:/data  -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=unix$DISPLAY pandashells

Should take care of X, the 3'rd request for #26) .

The first and second items in #26 is already supported by the PR.

I don't know about non-Unix environments sorry, I have access to a mac to see what can be done, but I suspect it won't be a as trivial as mounting a X socket in the container unfortunately...

So, in short, the PR as is should take care of #26 on Linux at least, with maybe a bit of usage documentation.

hayd commented 9 years ago

Perhaps it's better to use miniconda rather than pip? I think that removes all the dependencies too.

sthysel commented 9 years ago

I have no experience with miniconda or the conda ecosystem. My gut reaction is to avoid non-standard tools as they tend to lead you down the garden path. I don't think the tool-chain should be welded to a third party non-standard distribution. In this specific case using conda does not improve matters as far as I see it, it just introduces a rather large external non-aligned dependency for little gain.

Using docker as dependency management mechanism negates the need for miniconda or similar tools and keeps everything standard. Nothing prevents you from having multiple containers with different dependency versions which conda seems to buy you, of course.

hayd commented 9 years ago

Valid concerns. I was just taking this from the readme:

We strongly recommend using Anaconda to run Pandashells.

Since most (?) of these deps now offer a wheel, the pip install should be a lot faster than in the past - it used to be that conda was a lot faster. pip installing scipy used to take forever... (perhaps this is less an issue nowadays??)


If we were to use miniconda IIUC it would be something like:

RUN wget https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh;
RUN bash miniconda.sh -b -p $HOME/miniconda
RUN export PATH="$HOME/miniconda/bin:$PATH"
RUN conda install numpy astroML ...

see also https://github.com/chrish42/docker-miniconda/blob/master/Dockerfile

sthysel commented 9 years ago

Yes there would be no issue using conda in a container I'm sure, if that is what's required. I'm arguing that its not really necessary - as far as I can tell, not knowing what other benefits it may bring. I'd image Anaconda is suggested because it probably does alleviate a lot of the problems that docker avoids altogether, and maybe wheel has improved matters also... Maybe I'm just uncomfortable using a third party chain that to me does not bring any real benefits I can see.

I'd think that people that use Anaconda would prefer sticking to that, others may prefer something more generic and flexible like docker.

hayd commented 9 years ago

Your still using a third party chain in pypi. The question IMO is purely performance.

sthysel commented 9 years ago

pypi is provided and managed by the python software foundation just like python, pip is part of the standard python distribution, I don't think that depending on those tools qualifies as 'third party' in the same way continuum does. But maybe that's just pedantic.

I don't think performance is a issue either. Once the dependencies are built and packaged the container work is done. Pulling a complete pandashells environment takes me less than a minute:

docker pull sthysel/pandashells

Yes building it from scratch using the Dockerfile in the PR, takes about 10 minutes. Just one data point, and I have not compared that with a conda based solution of course.

Assuming of course I understand what you mean by performance.

hayd commented 9 years ago

Yep, sorry I meant the speed of installation.

robdmc commented 9 years ago

Good discussion. I advocate using conda in the documentation because it's the least painful way to get everything working properly, especially for the casual user. The dev-ops guys at my day job however, echo the wariness of conda that @sthysel expressed.

So although I'm a big fan of conda in general, I think having a Dockerfile that only uses pip would be useful since it will naturally take care of all the install pain, and will provide a good example of how to install Pandashells without conda. I'm waiting until the X forwarding problem (at least to a mac) is solved before merging this PR. (It'll probably be at least a week or two before I have time to look into this myself.)