pangeo-data / helm-chart

Pangeo helm charts
https://pangeo-data.github.io/helm-chart/
21 stars 26 forks source link

use conda-forge packages over pip in Dockerfiles? #34

Closed rsignell-usgs closed 5 years ago

rsignell-usgs commented 6 years ago

I just tried to build the pangeo-worker container and ran into version conflicts with pyasl1. These conflicts went away when I moved urllib3 from pip installed packages to the conda installed packages.

I thought the best practice was to use conda-forge packages if available and only use pip when conda packages were not available or when we need development stuff from git.

If that is true should we move all conda-installable packages to the conda install list?

jacobtomlinson commented 6 years ago

I think that would be a good idea

jhamman commented 6 years ago

IIRC, the rational here was that we were trying to keep the size of the notebook/worker images as small as possible and that pip tends to provide smaller binaries.

rsignell-usgs commented 6 years ago

@ocefpaf and I are going to try and see how the docker image sizes look.

ocefpaf commented 6 years ago

pip tends to provide smaller binaries.

Interesting. I would like to know where that happens so we can optimize on our side. Usually wheels are not only larger but also brings the same library on every package that depends on them due to the "static linking." For example, installing fiona and rasterio will add two versions of libgdal, libgeos, etc, when installing the conda version will install those dependecies only once:

wheel

The info above may be outdated with respect to rasterio and fiona, it seems that they are stripping the binaries and creating smaller packages for libgdal (~50 MB instead of ~100, that happened when I posted this on twitter :smile:), still the overall result is wasteful due to the static linking, and we can also strip the binaries on the conda-forge side.

PS: for pure Python packages the size should be virtually the same unless we add extras dependencies on the conda-forge side. That can be solved with a package-core + metapackage, like what we did for dask.

mrocklin commented 6 years ago

This issue may interest you: https://github.com/conda/conda/issues/6756

On Fri, Jun 8, 2018 at 3:35 PM, Filipe notifications@github.com wrote:

pip tends to provide smaller binaries.

Interesting. I would like to know where that happens so we can optimize on our side. Usually wheels are not only larger but also brings the same library on every package that depends on them due to the "static linking." For example, installing fiona and rasterio will add two versions of libgdal, libgeos, etc, when installing the conda version will install those dependecies only once:

[image: wheel] https://user-images.githubusercontent.com/950575/41176875-6d3434b0-6b38-11e8-8c58-79de4f34bab4.jpg

The info above may be outdated with respect to rasterio and fiona, it seems that they are stripping the binaries and creating smaller packages for libgdal (~50 MB instead of ~100, that happened when I posted this on twitter 😄), still the overall result is wasteful due to the static linking, and we can also strip the binaries on the conda-forge side.

PS: for pure Python packages the size should be virtually the same unless we add extras dependencies on the conda-forge side. That can be solved with a package-core + metapackage, like what we did for dask.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/helm-chart/issues/34#issuecomment-395866568, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszKjjl2h72bVdZ0YuzcnnKfTNHyaFks5t6tIEgaJpZM4UXrv9 .

mrocklin commented 6 years ago

Ah, moved to https://github.com/ContinuumIO/anaconda-issues/issues/8242

On Fri, Jun 8, 2018 at 3:56 PM, Matthew Rocklin mrocklin@anaconda.com wrote:

This issue may interest you: https://github.com/conda/conda/issues/6756

On Fri, Jun 8, 2018 at 3:35 PM, Filipe notifications@github.com wrote:

pip tends to provide smaller binaries.

Interesting. I would like to know where that happens so we can optimize on our side. Usually wheels are not only larger but also brings the same library on every package that depends on them due to the "static linking." For example, installing fiona and rasterio will add two versions of libgdal, libgeos, etc, when installing the conda version will install those dependecies only once:

[image: wheel] https://user-images.githubusercontent.com/950575/41176875-6d3434b0-6b38-11e8-8c58-79de4f34bab4.jpg

The info above may be outdated with respect to rasterio and fiona, it seems that they are stripping the binaries and creating smaller packages for libgdal (~50 MB instead of ~100, that happened when I posted this on twitter 😄), still the overall result is wasteful due to the static linking, and we can also strip the binaries on the conda-forge side.

PS: for pure Python packages the size should be virtually the same unless we add extras dependencies on the conda-forge side. That can be solved with a package-core + metapackage, like what we did for dask.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/helm-chart/issues/34#issuecomment-395866568, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszKjjl2h72bVdZ0YuzcnnKfTNHyaFks5t6tIEgaJpZM4UXrv9 .

ocefpaf commented 6 years ago

Thanks @mrocklin! As I thought each package is a different story and there is some work to do, from stripping the binaries to using some optimization flags. However, the question here is about these packages that are installed with pip. It looks to me that all but xgcm could be installed from conda.

rabernat commented 6 years ago

We would like to provide a conda package for xgcm. I just created an issue for this: https://github.com/xgcm/xgcm/issues/100

On Fri, Jun 8, 2018 at 4:09 PM Filipe notifications@github.com wrote:

Thanks @mrocklin https://github.com/mrocklin! As I thought each package is a different story and there is some work to do, from stripping the binaries to using some optimization flags. However, the question here is about these packages https://github.com/pangeo-data/helm-chart/blob/master/docker-images/notebook/Dockerfile#L43-L50 that are installed with pip. It looks to me that all but xgcm could be installed from conda.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/helm-chart/issues/34#issuecomment-395876214, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJFJp1MZzPqbDKWnyYQIj-P9a-1tM0mks5t6toDgaJpZM4UXrv9 .

ocefpaf commented 6 years ago

We would like to provide a conda package for xgcm. I just created an issue for this: xgcm/xgcm#100

I can assist you in creating the conda package on conda-forge but the pip install from git there gave me the impression that you always wanted latest master on each build, which is a no-no for a conda package. We need stable releases.

rabernat commented 6 years ago

Yes, we would also like to do a stable release!

Xgcm is growing from an experiment to a real package that people use. We need to take some time to do some maintenance and sort these things out.

On Fri, Jun 8, 2018 at 4:19 PM Filipe notifications@github.com wrote:

We would like to provide a conda package for xgcm. I just created an issue for this: xgcm/xgcm#100 https://github.com/xgcm/xgcm/issues/100

I can assist you in creating the conda package on conda-forge but the pip install from git there gave me the impression that you always wanted latest master on each build, which is a no-no for a conda package. We need stable releases.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/helm-chart/issues/34#issuecomment-395878698, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJFJoPTtdEom8Z6EPwoqqlGAuZZ5doqks5t6txRgaJpZM4UXrv9 .

rsignell-usgs commented 6 years ago

This Dockerfile uses only the conda-forge channel to build a pangeo-notebook image: https://github.com/rsignell-usgs/helm-chart/blob/conda-forge/docker-images/notebook/Dockerfile and because we used a single conda-forge channel, and fed all the packages in one shot, the requirements worked and we didn't need to pin any versions.

We verified on http://pangeo.esipfed.org that the following work:

BTW, the uncompressed notebook container size is 4.4GB, compressed size 1.0GB: https://hub.docker.com/r/esip/pangeo-notebook/tags/ That compressed size looks about the same as the notebook container at: https://hub.docker.com/r/pangeo/notebook/tags/

This was all due to the awesome conda-forge work of @ocefpaf !

mrocklin commented 6 years ago

Nice work!

On Sun, Jun 10, 2018 at 8:07 PM, Rich Signell notifications@github.com wrote:

This Dockerfile uses only the conda-forge channel to build a pangeo-notebook image: https://github.com/rsignell-usgs/helm-chart/blob/conda- forge/docker-images/notebook/Dockerfile and because we used a single conda-forge channel, and fed all the packages in one shot, the requirements worked and we didn't need to pin any versions.

We verified on pangeo.esipfed.org that the following work:

  • datashader/holoviews/geoviews
  • xarray parallel analysis of cloud-optimized geotiff image data via rasterio (thanks @jhamman https://github.com/jhamman!)
  • the regular examples using xarray with distributed

BTW, the uncompressed notebook container size is 4.4GB, compressed size 1.0GB: https://hub.docker.com/r/esip/pangeo-notebook/tags/ That compressed size looks about the same as the notebook container at: https://hub.docker.com/r/pangeo/notebook/tags/

This was all due to the awesome conda-forge work of @ocefpaf https://github.com/ocefpaf !

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/helm-chart/issues/34#issuecomment-396092840, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszEqt-PZ99kZfP4xMLJMBWfDvztJFks5t7bSpgaJpZM4UXrv9 .

rabernat commented 6 years ago

@rsignell-usgs that's great news!

I would love to get an example of the analysis you described above contributed to the new use case gallery. I can be submitted via PR to this repo as a fully executed notebook, so very little effort required beyond what you have already done.

Also, we should consider updating the main pangeo docker image / helm chart with the new environment you have created. It sounds like it has a lot more functionality but is not too much bigger than what we are using now

mrocklin commented 6 years ago

Is the pip-to-conda conversion done somewhere? Should this be submitted as a PR to the docker recipes in this repository?

guillaumeeb commented 5 years ago

From what I understand, the main work here has been done. So we could try to move to conda built Docker image?

Is this duplicated by #61?

jhamman commented 5 years ago

I think we're mostly there. We may want to consider removing the docker files from the helm chart repo all together. One idea that somewhat interests me is creating one or more curated images that can be used for pangeo. Recently, we've been moving to a hubploy system that builds images with r2d as part of a CI/CD system.

guillaumeeb commented 5 years ago

So where would you want to store the Docker images?

The idea I have for this repo is to be the base for deploying Pangeo on any K8S enabled cluster, be it on the public Cloud or elsewhere. I think this is great to build images using CI/CD system, but can we use hubploy only for the image building part, without fully deploying a chart on some cluster?

jhamman commented 5 years ago

I'm thinking of a repository similar to https://github.com/dask/dask-docker with a directory for each image. The images could be defined by dockerfile (as they are here) or a binder spec (as they are in the hubploy cases). In either case, we could add a simple CICD script to build the images using repo2docker on cicleci (or similar) and push them to dockerhub.

BTW, I'm fully onboard with having a nice stand alone helm chart and not only rely on the hubploy approach. My point here is that the notebook/worker image is somewhat ancillary to the chart itself and can be cleanly separated into its own thing.

jacobtomlinson commented 5 years ago

I also agree. The helm chart and docker image can be cleanly separated and the hubploy setup can be separate too. You just end up with a dependency chain.

guillaumeeb commented 5 years ago

That sounds good to me too! And yes we will have to think of the dependency chain and avoid duplications.

jhamman commented 5 years ago

So Project Jupyter already does this: https://github.com/jupyter/docker-stacks

Maybe we try to emulate what they are doing (or just team up with them for a few base images).

guillaumeeb commented 5 years ago

I've created a separated issue in pangeo main tracker for the new repo, as you can see. I think we can work on conda based images in parallel.

jacobtomlinson commented 5 years ago

@jhamman makes an interesting point. Project Jupyter already has some specialized notebook images for different tasks. Perhaps they would accept a pangeo notebook too?

guillaumeeb commented 5 years ago

@jacobtomlinson I believe it would be better to add your comment on https://github.com/pangeo-data/pangeo/issues/525, don't you think?