pangeo-data / pangeo-docker-images

Docker Images For Pangeo Jupyter Environment
https://pangeo-docker-images.readthedocs.io
MIT License
127 stars 91 forks source link

Package Discussion #28

Open TomAugspurger opened 4 years ago

TomAugspurger commented 4 years ago

What packages belong in a "default" pangeo metapackage? Currently pangeo-notebook has essentially dask + jupyterhub + jupyterlab. https://github.com/conda-forge/pangeo-notebook-feedstock/blob/master/recipe/meta.yaml. IMO, there's value in having a minimal metapackage.

There's also value in a "useful" metapackage that includes things like

In pangeo-stacks we called this pangeo-notebook: https://github.com/pangeo-data/pangeo-stacks/blob/a8cf6aefa36800301977390a785d06edac9b915e/pangeo-notebook/binder/environment.yml.

Also, what should we call this? Perhaps just pangeo?

cc @rabernat @scottyhq @jhamman

TomAugspurger commented 4 years ago

Here's my (deliberately too small to provoke discussion) proposed list

  # dask, jupyterlab
  - pangeo-notebook
  # core scipy packages
  - numpy
  - scipy
  - matplotlib-base
  - pandas
  - xarray
  - sparse
  - sympy
  # intake-related
  - intake
  - intake-xarray
  - intake-esm
  - fsspec
  - intake-stac
  # zarr-related
  - zarr
  - gcsfs
  - s3fs

Notably absent are

I think some of those should be included in this kitchen sink package, but I'm not sure which.

jkingslake commented 4 years ago

Hi @TomAugspurger A few questions: When you say Xarray-adjacent, do you mean the optional dependencies listed here?

What are the main advantages of keeping the list small? Is that it makes opening the binder faster? Or less likely to run into conflicts?

Personally nc-time-axis would to useful, but I'm not sure how widely its needed.

From my perspective as someone who is very new to pangeo (and python actually) and who is spreading the word to colleagues with no prior knowledge, there is a big advantage in being able to send someone a link and the binder start up without the need to installing any extra packages. This is an obvious point I guess, but I thought I would emphasize it from my novice point of view.

willirath commented 4 years ago

Personally nc-time-axis would to useful, but I'm not sure how widely its needed.

+1 for this one. In the Ocean and Climate modelling world, non-standard calendars are the real standard.

TomAugspurger commented 4 years ago

When you say Xarray-adjacent, do you mean the optional dependencies listed here?

I didn't have any specific ones in mind.

there is a big advantage in being able to send someone a link and the binder start up without the need to installing any extra packages.

Agreed. I think we can be fairly broad with what ends up in the "kitchen sink" pangeo-notebook docker image.

TomAugspurger commented 4 years ago

As part of the ocean.pangeo.io fixup, we're looking to remove the environment build step for deployments in pangeo-cloud-federation and just use the pangeo-notebook docker image.

I went through ocean's environment.yaml. Of the packages there and not in the pangeo-notebook image, there are three main types of packages

  1. Packages added for OHW19 in https://github.com/pangeo-data/pangeo-cloud-federation/pull/373
- compliance-checker
- ciso
- cc-plugin-ncei
- ctd
- geolinks
- gridgeo
- ioos-tools
- pocean-core
- podaccpy
- retrying
- unyt
- utide
- xlrd
  1. Packages added to help the solver (e.g. https://github.com/pangeo-data/pangeo-cloud-federation/pull/574)
- fiona
- ipython
- netcdf4
- setuptools 
  1. Actually useful packages added by a user.
- ciso
- fastjmd95
- nc-time-axis
- netcdf4
- pyarrow
- xcape
- xlayers
- xmitgcm

My proposal is to add the actually useful packages to pangeo-notebook.

There are also a few borderline ones:

Does anyone have objections to adding that list of "useful" packages to pangeo-notebook?

rabernat commented 4 years ago

My proposal is to add the actually useful packages to pangeo-notebook.

:+1: in general...But this list seems to be incomplete. I would say that these are the important ones:

- ciso
- xgcm
- xrft
- xhistogram
- xlayers
- xcape
- git+https://github.com/xgcm/fastjmd95.git
- git+https://github.com/NCAR/intake-esm.git

These should be upstreamed to pangeo-notebook

- pyarrow
- netcdf4
- nc-time-axis
TomAugspurger commented 4 years ago

Sorry, I missed these when copy-pasting

I don't see xrft or xhistogram in ocean.pangeo.io's environment. But I can add them.

intake-esm is the pangeo-notebook environment already, but it's installed from conda-forge rather than GitHub. If it's OK I'd prefer to push on projects to issue releases, and only install from source as necessary. Likewise for fastjmd95 (though it's not currently in pangeo-notebook).

rabernat commented 4 years ago

I don't see xrft or xhistogram in ocean.pangeo.io's environment.

They have definitely been there in the past!

If it's OK I'd prefer to push on projects to issue releases, and only install from source as necessary.

:+1:

Likewise for fastjmd95 (though it's not currently in pangeo-notebook).

I'm going to release it now.

rabernat commented 4 years ago

fastjmd95 now released on pip. Is conda vs. pip important?

TomAugspurger commented 4 years ago

It'll be the only one not from conda-forge, but it doesn't matter too much for a pure-python package I think. We can move it if / when it becomes available on conda-forge.

On Mon, Jun 22, 2020 at 2:41 PM Ryan Abernathey notifications@github.com wrote:

fastjmd95 now released on pip. Is conda vs. pip important?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-docker-images/issues/28#issuecomment-647732348, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOITCDY52LLBORIZHGUDRX6XVHANCNFSM4LW5AYNA .

scottyhq commented 4 years ago

Thanks @TomAugspurger and @rabernat for pushing this forward. I think there is a lot of value in using exactly the same image across cloud deployments, and that is the current intention with pangeo-notebook. I'm also wary of large image size and troubleshooting inevitable package conflicts as the list of desirable packages grows. For example, here is the list of packages used by request during the recent icesat2 hackweek https://github.com/ICESAT-2HackWeek/jupyter-image-2020/blob/master/environment.yml.

These should be upstreamed to pangeo-notebook

in my opinion the current meta-package should include just the minimum set of packages to launch a dask gateway cluster and connect to the labextension dashboard. We should consider whether it's worthwhile creating additional meta-packages and/or renaming things to make this more obvious (such as pangeo-notebook --> pangeo-ui or pangeo? and then separately you could have a pangeo-analysis or pangeo-ocean metapackage).

If image building is dropped from pangeo-cloud-federation, it's also possible to include additional domain or hub-specific images (e.g. hub-aws-uswest2, hub-gcp-uscentral1b) in this repository and refactor the CI to build images independently.