pangeo-data / pangeo-docker-images

Docker Images For Pangeo Jupyter Environment
https://pangeo-docker-images.readthedocs.io
MIT License
127 stars 92 forks source link

Does onbuild work with these new images? #60

Open rabernat opened 4 years ago

rabernat commented 4 years ago

I would like to extend the pangeo-notebook image, as we used to do in the old system. I made the following repo: https://github.com/rabernat/poseidon-bot/tree/binder with the following Dockerfile

FROM pangeo/pangeo-notebook:9d0723d

plus an environment.yaml file. But it just ignores the environment.yaml file.

Is this "onbuild" capability no longer supported? If not, how do we recommend extending the images?

Binder: https://binder.pangeo.io/v2/gh/rabernat/poseidon-bot/binder

TomAugspurger commented 4 years ago

@rabernat ~I think only the base-image image supports an environment.yaml (@scottyhq can confirm?)~ (I think I was incorrect)

Also, dunno if it matters, but it may need to be environment.yml rather than environment.yaml.

So if you want stuff I think pangeo-notebook in your environment.yaml.

Trying this out at https://binder.pangeo.io/v2/gh/TomAugspurger/poseidon-bot/binder / https://github.com/TomAugspurger/poseidon-bot/tree/binder

rabernat commented 4 years ago

Also, dunno if it matters, but it may need to be environment.yml rather than environment.yaml.

If this is the reason, 🤦

Thanks for looking into it!

scottyhq commented 4 years ago

@TomAugspurger @rabernat, sorry I didn't see this issue until now b/c I wasn't 'watching' the repository! I thought that would happen by default.

Is this "onbuild" capability no longer supported?

Correct, no longer supported.

If not, how do we recommend extending the images?

The nearest thing to onbuild is using the base image rather than one of the notebook images so just change your Dockerfile to: FROM pangeo/base-image:9d0723d

This puts the responsibility on the binder creater to add all the necessary sidecar files. You don't need a lock file, and can just modify environment.yml from here: https://github.com/pangeo-data/pangeo-docker-images/tree/master/pangeo-notebook

TomAugspurger commented 4 years ago

@scottyhq with that I see

Checking for 'postBuild'...
/srv/conda/envs/notebook/lib/python3.8/site-packages/traitlets/config/loader.py:795: SyntaxWarning: "is" with aliteral. Did you mean "=="?
  if len(key) is 1:
/srv/conda/envs/notebook/lib/python3.8/site-packages/traitlets/config/loader.py:804: SyntaxWarning: "is" with aliteral. Did you mean "=="?
  if len(key) is 1:
Enabling: nbgitpuller
- Writing config: /srv/conda/envs/notebook/etc/jupyter
    - Validating...
      nbgitpuller 0.8.0 OK
rm: cannot remove '/tmp/*': No such file or directory
Removing intermediate container c86e1dad71a0

I think if the postBuild file doesn't create any files then https://github.com/pangeo-data/pangeo-docker-images/blob/9d0723dbb375fe728a44985e4ae4ae961677890b/base-image/Dockerfile#L115 will fail. Will push a fix shortly.

TomAugspurger commented 4 years ago

@rabernat seems to work as of https://github.com/TomAugspurger/poseidon-bot/tree/binder.

https://binder.pangeo.io/v2/gh/TomAugspurger/poseidon-bot/binder.

That at least builds and I can import xgcm.

https://github.com/TomAugspurger/poseidon-bot/blob/fc78c14e2fb0c87f9aa4cc411a494bd8d0f8d323/postBuild#L6 will be unneeded when https://github.com/pangeo-data/pangeo-docker-images/pull/61 is merged.

scottyhq commented 4 years ago

Also, just to clarify, if you want all the pangeo-notebook packages + others like xgcm in https://github.com/TomAugspurger/poseidon-bot/tree/binder, you'll need to

1) append your additional packages to a copy of the pangeo-notebook environment.yml in your binder environment.yml (https://github.com/TomAugspurger/poseidon-bot/blob/binder/environment.yml)

2) Add the standard jupyterlab extensions from pangeo-notebook postBuild.yml to your binder postBuild (https://github.com/TomAugspurger/poseidon-bot/blob/binder/postBuild).

rabernat commented 4 years ago

I appreciate the quick responses and clarifications. Sorry for being slow to understand how this all fits together.

I miss the ability to extend pangeo-notebook. I thought that was a very useful and convenient way to work. I don't miss keeping track of long environment.yaml files. I hope we can find a way to bring this back somehow.

rabernat commented 4 years ago

It is also not clear to me whether I gain anything by including a Dockerfile with FROM pangeo/base-image:9d0723d. Since I have to enumerate all the packages and add a postBuild anyway, isn't it just simpler to use a normal binder?

scottyhq commented 4 years ago

It is also not clear to me whether I gain anything by including a Dockerfile with FROM pangeo/base-image:9d0723d. Since I have to enumerate all the packages and add a postBuild anyway, isn't it just simpler to use a normal binder?

You gain a faster build and resulting image that is about 1/2 the size with 1/2 the layers. The key difference is a single conda solve+install instead of installing additional packages into an existing environment. But sure, you can drop the Dockerfile and stick with repo2docker condabuildpack if you prefer.

I miss the ability to extend pangeo-notebook. I thought that was a very useful and convenient way to work. I don't miss keeping track of long environment.yaml files. I hope we can find a way to bring this back somehow.

This was discussed at various points over the last couple months in https://github.com/pangeo-data/pangeo-docker-images/issues/2. The original design had that ability. But essentially there is a tradeoff between the convenience of onbuild layering versus the transparency of explicitly listing what goes into an environment under one folder.

TomAugspurger commented 4 years ago

Can we get the best of both worlds by making a pangeo-kitchen-sink (name TDB) metapackage with the contents of https://github.com/pangeo-data/pangeo-docker-images/blob/master/pangeo-notebook/environment.yml, and then someone wishing to customize that with a few packages has

# Dockerfile
FROM pangeo-base-image:tag
environment.yml
name: pangeo
channels:
  - conda-forge
dependencies:
  - pangeo-kitchen-sink=2020.04.14
  - my-custom-package
scottyhq commented 4 years ago

@TomAugspurger - I think it could be worthwhile to do that, but it's hard to decide what goes into the kitchen sink... hopefully folks will chime in about that here https://github.com/pangeo-data/pangeo-docker-images/issues/28.

This issue makes my think we should change the label of base-image to base-image-onbuild given the best practices on naming (https://docs.docker.com/develop/develop-images/dockerfile_best-practices). Hopefully that would also help clarify that the notebook images do not have onbuild commands baked into them.

rabernat commented 4 years ago

I tried following this advice and creating a new binder based on pangeo-notebook.

It's here:

The environment differs from pangeo-notebook only by about 5 extra packages at the end: https://github.com/pangeo-gallery/cmip6/blob/abd71d5e62c7d8bde5ec22f896846277463009ad/binder/environment.yml#L55-L59

This binder will not build. The conda environment can't be solved. The build ends with about 10000 messages like this:

Package zstd conflicts for:
tiledb-py -> tiledb[version='>=1.7.7,<1.8.0a0'] -> zstd[version='1.3.2|1.3.3|>=1.3.3,<1.3.4.0a0|>=1.3.7,<1.3.8.0a0|>=1.4.0,<1.4.1.0a0|>=1.4.3,<1.4.4.0a0|>=1.4.4,<1.4.5.0a0']
rasterio -> libgdal[version='>=3.0.4,<3.1.0a0'] -> zstd[version='>=1.3.7,<1.3.8.0a0|>=1.4.4,<1.4.5.0a0']
geopandas -> pysal -> zstd
python-blosc -> blosc[version='>=1.16.3,<2.0a0'] -> zstd[version='>=1.3.7,<1.3.8.0a0']

Package wheel conflicts for:
python=3.7 -> pip -> wheel
pip=20 -> wheel

Package backports conflicts for:
numcodecs -> backports.lzma -> backports
matplotlib-base -> backports.functools_lru_cache -> backports

Package markupsafe conflicts for:
pydap -> jinja2 -> markupsafe[version='>=0.23']
intake -> jinja2 -> markupsafe[version='>=0.23']
Note that strict channel priority may have removed packages required for satisfiability.

Not sure how to move forward. Any advice from the repo2docker gurus would be appreciated.

TomAugspurger commented 4 years ago

@rabernat I think that's strictly a conda issue. If I had to guess it's the mix of

 - pangeo-notebook=2020.04.25
 - distributed=2.15.1

since pangeo-notebook pins distributed exactly (via pangeo-dask), you end up with two different exact pins.

Conda really should be able to provide a better error message, but it's apparently somewhat hard to do generally.

Bumping to pange-notebook=2020.04.28 and removing distributed should do the trick.

rabernat commented 4 years ago

Yes, that fixed it! Thanks @TomAugspurger! :+1:

rabernat commented 4 years ago

However, the dask dashboard is still inaccessible. I get a 404 error.

scottyhq commented 4 years ago

Just tried this with pangeo/pangeo-notebook:2020.04.28 on Rich's example notebook on AWS and I'm also seeing 404 trying to connect to the dashboard. Jupyter pod logs show:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/web.py", line 1703, in _execute
    result = await result
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_server_proxy/websocket.py", line 97, in get
    return await self.http_get(*args, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_server_proxy/handlers.py", line 359, in http_get
    return await self.proxy(port, proxied_path)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_server_proxy/handlers.py", line 225, in proxy
    response = await client.fetch(req, raise_error=False)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/iostream.py", line 1226, in connect
    self.socket.connect(address)
OSError: [Errno 99] Cannot assign requested address
LabApp - ERROR - Uncaught exception GET /user/scottyhq-pangeodev-binder-9xnng5u1/proxy/8787/individual-plots.json?1588175640199 (192.168.59.217)
HTTPServerRequest(protocol='https', host='hub.aws-uswest2-binder.pangeo.io', method='GET', uri='/user/scottyhq-pangeodev-binder-9xnng5u1/proxy/8787/individual-plots.json?1588175640199', version='HTTP/1.1', remote_ip='192.168.59.217')
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/tcpclient.py", line 143, in on_connect_done
    stream = future.result()
tornado.iostream.StreamClosedError: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/web.py", line 1703, in _execute
    result = await result
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_server_proxy/websocket.py", line 97, in get
    return await self.http_get(*args, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_server_proxy/handlers.py", line 359, in http_get
    return await self.proxy(port, proxied_path)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyter_server_proxy/handlers.py", line 225, in proxy
    response = await client.fetch(req, raise_error=False)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/iostream.py", line 1226, in connect
    self.socket.connect(address)
OSError: [Errno 99] Cannot assign requested address
scottyhq commented 4 years ago

This is a helpful command/URL to see the packages changed between tags: https://github.com/pangeo-data/pangeo-docker-images/compare/2020.04.22..2020.04.28#diff-fc143938f9485967d3be2239526ec787

@rabernat I'm guessing there is still an issue with distributed 2.15.1 since tornado and jupyter_server_proxy haven't changed. As a short-term solution can you just drop back to pangeo-notebook=2020.04.22?

jhamman commented 4 years ago

just cross referencing https://github.com/conda-forge/pangeo-notebook-feedstock/pull/15 which may be important here soon.