Open rabernat opened 5 years ago
Any thoughts on this anyone?
https://github.com/jupyter/repo2docker/pull/638 implements this in repo2docker.
I'm going to re-open this with the goal of reducing our image sizes. The base image is still at 950 Mb compressed and 2.7GB pulled:
pangeo/base-notebook 2019.06.24 e51d49f3c1ed 7 hours ago 2.7GB
Some related resources and discussion: https://jcrist.github.io/conda-docker-tips.html https://github.com/pangeo-data/pangeo-cloud-federation/pull/305 jupyter/repo2docker#714
@scottyhq are you planning to work on this? I can put some time into it if you want (ironically, while waiting for my new images to be pulled 😄)
The issue to track about repo2docker speed ups, smaller images, etc is https://github.com/jupyter/repo2docker/pull/707
I’m not going to be working on this myself in the near future, so any contributions are welcome ;) just wanted to connect some dots.
Looked into this a bit last night. In terms of what pangeo-stacks can fix itself, https://github.com/pangeo-data/pangeo-stacks/pull/116 is the biggest offender I think.
I'm going a bit further up the stack now. There's a lot in the base image from repo2docker that we probably don't need.
/usr/local
and one in the conda notebook env.I'll investigate a bit more before reporting those upstream.
In terms of what pangeo-stacks can fix itself, #116 is the biggest offender I think.
In terms of what is not directly related to pangeo-stacks
we (I?) could work on is to resume the static library split on conda-forge
.
These are some numbers I presented at the Seattle pangeo meeting for a basic geospatial env with conda-forge
:
1.8G GEO # old
1.7G CONDA_GEO # current (only a few splits, stopped at libnetcdf)
452M no-static_GEO # removing all `.a` files from the env.
Note that one can already removed all the .a
from the envs with something like:
find /opt/conda/ -follow -type f -name '*.a' -delete
If you are not doing that already you can definitely try. The splitting in conda-forge will give you a choice to have those files when you need them though (which is rare but not zero).
2. We have two installs of nodejs / npm. One from apt-get in
/usr/local
and on in the conda notebook env.
Thanks for reviving this discussion @TomAugspurger . For some perspective on nodejs coming from two places see this discussion on repo2docker a while back https://github.com/jupyter/repo2docker/pull/728 . It might now be possible to just get it from Conda-forge if you want to re-raise the issue there.
Another way to reduce size is to remove everything related to qt
. Cloud jupyter
deployment rarely need them.
To avoid that one must substitute jupyter
for jupyter_core
in the first layer of this stack (repo2ocker
?) Here are some numbers
271M JUPYTER_AND_JUPYTERLAB
192M JUPYTER-CORE_AND_JUPYTERLAB
~80 MB difference.
Also, the pangeo-notebook is pulling the mkl package mkl-2019.5 | 205.2 MB
! See logs or pull 'latest' image:
https://github.com/pangeo-data/pangeo-stacks/runs/428527057?check_suite_focus=true
is there an easy way to see what package pulls mkl as a dependency? because its not listed explicitly in our environment.yaml: https://github.com/pangeo-data/pangeo-stacks/blob/master/pangeo-notebook/binder/environment.yml
You can typically include the nomkl
package, which should prevent it from being pulled.
You can typically include the
nomkl
package, which should prevent it from being pulled.
I was investigating this but it looks like both workarounds, nomkl
and installing blas=*=openblas
no longer work. Not sure what is happening but we added mkl
in conda-forge
recently and that may be the culprit.
Update on the mkl
package problem here. Adding nomkl
won't work b/c conda-forge's blas implementation drifted a little bit from defaults
(The whole discussion is in our gitter channel if someone is interested).
We will probably solve that with https://github.com/conda-forge/staged-recipes/pull/10922. Note that the conda-forge nomkl
will not remove/prevent mkl
from getting into the env here! It will only cause a conflict with the package that is pulling mkl
. We can then debug why that package is doing it. (Like, do we have an openblas version of it? Or is the mkl
variant getting precedence over the openblas variant due to an error, etc.)
Our docker images are storing about 2.6 GB worth of conda packages in /srv/conda/pkgs
Do we actually need this? Can we clean it out and drastically reduce the size of the images?