pangeo-data / pangeo-stacks

Curated Docker images for use with Jupyter and Pangeo
https://pangeo-data.github.io/pangeo-stacks/
BSD 3-Clause "New" or "Revised" License
17 stars 20 forks source link

clean out package dir to reduce image size? #22

Open rabernat opened 5 years ago

rabernat commented 5 years ago

Our docker images are storing about 2.6 GB worth of conda packages in /srv/conda/pkgs

$ du -h -d1 /srv/conda
4.0K    /srv/conda/envs
2.7G    /srv/conda/pkgs
4.0K    /srv/conda/compiler_compat
31M     /srv/conda/bin
128K    /srv/conda/etc
4.0K    /srv/conda/conda-bld
25M     /srv/conda/conda-meta
539M    /srv/conda/lib
12K     /srv/conda/x86_64-conda_cos6-linux-gnu
7.6M    /srv/conda/include
8.0K    /srv/conda/ssl
8.0K    /srv/conda/man
314M    /srv/conda/share
12K     /srv/conda/shell
92K     /srv/conda/libexec
412K    /srv/conda/sbin
640K    /srv/conda/mkspecs
8.0K    /srv/conda/condabin
20K     /srv/conda/docs
12K     /srv/conda/translations
36K     /srv/conda/doc
332K    /srv/conda/plugins
4.0K    /srv/conda/phrasebooks
156K    /srv/conda/qml
12K     /srv/conda/var
3.6G    /srv/conda

Do we actually need this? Can we clean it out and drastically reduce the size of the images?

rabernat commented 5 years ago

Any thoughts on this anyone?

yuvipanda commented 5 years ago

https://github.com/jupyter/repo2docker/pull/638 implements this in repo2docker.

jhamman commented 5 years ago

closed by https://github.com/jupyter/repo2docker/pull/638

scottyhq commented 5 years ago

I'm going to re-open this with the goal of reducing our image sizes. The base image is still at 950 Mb compressed and 2.7GB pulled:

pangeo/base-notebook 2019.06.24 e51d49f3c1ed 7 hours ago 2.7GB

Some related resources and discussion: https://jcrist.github.io/conda-docker-tips.html https://github.com/pangeo-data/pangeo-cloud-federation/pull/305 jupyter/repo2docker#714

TomAugspurger commented 5 years ago

@scottyhq are you planning to work on this? I can put some time into it if you want (ironically, while waiting for my new images to be pulled 😄)

betatim commented 5 years ago

The issue to track about repo2docker speed ups, smaller images, etc is https://github.com/jupyter/repo2docker/pull/707

scottyhq commented 5 years ago

I’m not going to be working on this myself in the near future, so any contributions are welcome ;) just wanted to connect some dots.

TomAugspurger commented 4 years ago

Looked into this a bit last night. In terms of what pangeo-stacks can fix itself, https://github.com/pangeo-data/pangeo-stacks/pull/116 is the biggest offender I think.

I'm going a bit further up the stack now. There's a lot in the base image from repo2docker that we probably don't need.

  1. R2D base image uses buildpacks:bionic (397MB). This is probably larger than we need since it has things like GCC.
  2. We have two installs of nodejs / npm. One from apt-get in /usr/local and one in the conda notebook env.
  3. Conda env: There are a few libraries that may not be appropriate for a base image, if we're going for minimal size
    • nbcovert (120 MB). From the base r2d notebook env. Brings in pandoc & pandoc-citeproc
    • nteract_on_jupyter (142 MB). From the base r2d env. Not sure why it's so large yet.

I'll investigate a bit more before reporting those upstream.

ocefpaf commented 4 years ago

In terms of what pangeo-stacks can fix itself, #116 is the biggest offender I think.

In terms of what is not directly related to pangeo-stacks we (I?) could work on is to resume the static library split on conda-forge.

These are some numbers I presented at the Seattle pangeo meeting for a basic geospatial env with conda-forge:

1.8G    GEO  # old
1.7G    CONDA_GEO  # current (only a few splits, stopped at libnetcdf)
452M    no-static_GEO  # removing all `.a` files from the env.

Note that one can already removed all the .a from the envs with something like:

find /opt/conda/ -follow -type f -name '*.a' -delete

If you are not doing that already you can definitely try. The splitting in conda-forge will give you a choice to have those files when you need them though (which is rare but not zero).

scottyhq commented 4 years ago

2. We have two installs of nodejs / npm. One from apt-get in /usr/local and on in the conda notebook env.

Thanks for reviving this discussion @TomAugspurger . For some perspective on nodejs coming from two places see this discussion on repo2docker a while back https://github.com/jupyter/repo2docker/pull/728 . It might now be possible to just get it from Conda-forge if you want to re-raise the issue there.

ocefpaf commented 4 years ago

Another way to reduce size is to remove everything related to qt. Cloud jupyter deployment rarely need them.

To avoid that one must substitute jupyter for jupyter_core in the first layer of this stack (repo2ocker?) Here are some numbers

271M    JUPYTER_AND_JUPYTERLAB
192M    JUPYTER-CORE_AND_JUPYTERLAB

~80 MB difference.

scottyhq commented 4 years ago

Also, the pangeo-notebook is pulling the mkl package mkl-2019.5 | 205.2 MB! See logs or pull 'latest' image: https://github.com/pangeo-data/pangeo-stacks/runs/428527057?check_suite_focus=true

is there an easy way to see what package pulls mkl as a dependency? because its not listed explicitly in our environment.yaml: https://github.com/pangeo-data/pangeo-stacks/blob/master/pangeo-notebook/binder/environment.yml

TomAugspurger commented 4 years ago

You can typically include the nomkl package, which should prevent it from being pulled.

ocefpaf commented 4 years ago

You can typically include the nomkl package, which should prevent it from being pulled.

I was investigating this but it looks like both workarounds, nomkl and installing blas=*=openblas no longer work. Not sure what is happening but we added mkl in conda-forge recently and that may be the culprit.

ocefpaf commented 4 years ago

Update on the mkl package problem here. Adding nomkl won't work b/c conda-forge's blas implementation drifted a little bit from defaults (The whole discussion is in our gitter channel if someone is interested).

We will probably solve that with https://github.com/conda-forge/staged-recipes/pull/10922. Note that the conda-forge nomkl will not remove/prevent mkl from getting into the env here! It will only cause a conflict with the package that is pulling mkl. We can then debug why that package is doing it. (Like, do we have an openblas version of it? Or is the mkl variant getting precedence over the openblas variant due to an error, etc.)