pangeo-data / pangeo-eosc

Pangeo for the European Open Science cloud
https://pangeo-data.github.io/pangeo-eosc/
MIT License
3 stars 3 forks source link

Dask gateway configuration problems on pangeo-xxlarge platform #3

Closed guillaumeeb closed 1 year ago

guillaumeeb commented 1 year ago

I thinks this has already been said, but as I'm currently reviewing notebooks on the infrastructure, I just thought I'd open issues to note the problems.

So first, the Dashboard link is not working.

Clicking on the generated Dashboard link, for instance Dashboard: [/services/dask-gateway/clusters/daskhub.e9bff8eab5134c32a5db353c5655c1f1/status](https://pangeo-xxlarge.vm.fedcloud.eu/services/dask-gateway/clusters/daskhub.e9bff8eab5134c32a5db353c5655c1f1/status) leads to a 404 error.

Connecting a client to the cluster generates a version mismatch:

/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py:1274: VersionMismatchWarning: Mismatched versions found

+---------+----------------+----------------+----------------+
| Package | client         | scheduler      | workers        |
+---------+----------------+----------------+----------------+
| lz4     | 4.0.0          | None           | None           |
| pandas  | 1.4.2          | None           | None           |
| python  | 3.9.13.final.0 | 3.10.5.final.0 | 3.10.5.final.0 |
+---------+----------------+----------------+----------------+
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))

It's probably because the Docker image used by Jupyterhub for singleuser notebook and dask-gateway for workers is not the same.

tinaok commented 1 year ago

Yes it is due to the docker image on jupyterlab and Dask worker's image is not the same image. But the warning did not prevent the job run when I tried last time. (and it is a good explanation we can use for tutorial, to show /understand distributed computing)

What was problematic on this configuration last week when I tried for the tutorial are ;

@j34ni or @annefou might have some update on this, but your experience with kubctrl /jupyter hub might help?

j34ni commented 1 year ago

I agree that the fact that the Dask Gateway uses a password instead of JupyterHub to authenticate is an issue for the longer term. However I find that it is an advantage for the workshop because we will be able to shutdown clusters left running by participants (or multiple clusters opened by mistake) and hence release resources.

j34ni commented 1 year ago

@guillaumeeb: The dashboard link now works with the latest versions of the setup (pangeo-foss4g, for instance) when we also install Grafana

guillaumeeb commented 1 year ago

Yes it is due to the docker image on jupyterlab and Dask worker's image is not the same image. But the warning did not prevent the job run when I tried last time. (and it is a good explanation we can use for tutorial, to show /understand distributed computing)

As I said in https://github.com/pangeo-data/foss4g-2022/issues/20#issuecomment-1212222842, I really think at least the images should be the same. Even if in this case versions are sufficiently closed for the Client/Cluster to be working, this really is a bad practice and often can cause unwanted errors.

About the dask-gateway authentication, I concur with @j34ni, this is really not an issue for the workshop, but should be addressed in a longer term.

And if the Dashboard link now works, that's great! @j34ni should I go back to pangeo-foss4g instance to test things?

guillaumeeb commented 1 year ago

@j34ni I finally logged in the front VM of pangeo-foss4g deployment. Looking at the values.yaml file produced by the following command:

sudo helm get values daskhub -n daskhub

It looks like dask-gateway is not enabled on this instance, is that correct?

dask-gateway:
  enabled: false
  gateway:
    auth:
      simple:
        password: pangeo_dask
      type: simple
dask-kubernetes:
  enabled: true
jupyterhub:
  hub:
    baseUrl: /jupyterhub/
    config:
      GenericOAuthenticator:
        allowed_groups:
        - urn:mace:egi.eu:group:vo.pangeo.eu:role=member#aai.egi.eu
        authorize_url: https://aai-dev.egi.eu/auth/realms/egi/protocol/openid-connect/auth
        claim_groups_key: eduperson_entitlement
        client_id: id
        client_secret: secret
        login_service: EGI Check-In
        oauth_callback_url: https://pangeo-foss4g.vm.fedcloud.eu/jupyterhub/hub/oauth_callback
        scope:
        - openid
        - email
        - profile
        - eduperson_entitlement
        token_url: https://aai-dev.egi.eu/auth/realms/egi/protocol/openid-connect/token
        userdata_params:
          state: state
        userdata_url: https://aai-dev.egi.eu/auth/realms/egi/protocol/openid-connect/userinfo
        username_key: preferred_username
      JupyterHub:
        authenticator_class: generic-oauth
  ingress:
    annotations:
      kubernetes.io/ingress.class: nginx
    enabled: true
  proxy:
    secretToken: hash
  singleuser:
    cpu:
      guarantee: 2
      limit: 4
    image:
      name: pangeo/ml-notebook
      tag: latest
    memory:
      guarantee: 4G
      limit: 16G

Would this be possible to make some tests on one instance or the other, or do you prefer to keep things as is? Currently, I don't have access to pangeo-xxlarge platform.

j34ni commented 1 year ago

@guillaumeeb: I did not manage to have a working infrastructure with at the same time EGI Check-in, a dask-gateway and increased CPU & memory limits. The values.yaml you produced is what was in my email from Tue 2022-08-09 10:30. Feel free to modify and do as many tests as you want on pangeo-foss4g.

guillaumeeb commented 1 year ago

Closing this one as solved.