Creating pangeo-eosc infrastructure based on elastic Kubernetes Virtual Cluster using IM-Dashboard

tinaok commented 1 year ago

This is an issue so that we can coordinate for creating pangeo-eosc infrastructure based on elastic Kubernetes Virtual Cluster using IM-Dashboard.

tinaok commented 1 year ago

Here is what I have in mind as to do list. Any thoughts @guillaumeeb & @j34ni ?

- [x] We need resource to test the elastic options. I.e. need to shutdown some instances to free VCPUs for start testing.
- [x] Feed back from former tests during August?
- [ ] Find a way to create the infrastructure using 'elastic' option on IM-Dashboard
- [ ] Creation (how do we set the limit : minimum and max? can we change the max time to time for hosting tutorials? for example?)
- [ ] Benchmarks

guillaumeeb commented 1 year ago

👍 for me! I'll let @j34ni answer the first two questions. For point 4, not sure we really need a maximum 🙂.

j34ni commented 1 year ago

@tinaok : 1- Feel free to remove all the nodes you want from the existing infrastructures (I did not manage to login into the IM dashboard lately) 2- The two of them worked great. The one without EGI check-in saved the day at the workshop since a lot of the participants had not enrolled. I have also used them after and they are quite "responsive", more than 8GB RAM would be nice for some applications though

guillaumeeb commented 1 year ago

@j34ni @tinaok, I think at one point the Elastic Kubernetes offer from IM Dashboard was tried? Was there any reason why it was not kept?

j34ni commented 1 year ago

@guillaumeeb I do not remember that, may be was it at the same time as other things which failed and we went back a few steps to get something working?

guillaumeeb commented 1 year ago

Yeah probably. I guess we just need to make some room on our VMs and try to redeploy an elastic version on Kubernetes to host our Pangeo platform.

sebastian-luna-valero commented 1 year ago

Ok, following up from https://github.com/pangeo-data/pangeo-eosc/issues/21

I think we never tried the elastic option before since we wanted to focus on other higher priority issues.

Have you tried the elastic option now? If so, could you please provide feedback?

Happy to help with this.

guillaumeeb commented 1 year ago

Have you tried the elastic option now? If so, could you please provide feedback?

Not yet for my part. I can put this on my todo-list for the next days/weeks if @j34ni or @tinaok hasn't time to do so.

tinaok commented 1 year ago

Hi @guillaumeeb thank you very much, I had no time (and will have no time at all this week neither) for trying out unfortunately. Your help will be super appreciated.
I think it can be named as 'pangeo-eosc'

guillaumeeb commented 1 year ago

In the process of creating a Pangeo deployment with elastic Kubernetes on IM Dashboard, following https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md, I've got some questions/remarks (noting there as much as for other people as for myself):

https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md#step-1-dns-name says to create a new domain, but what we really need is a DNS name right, so a new host.
I selected Kubernetes, then added Elastic option and Grafana. No other options. I'm planning to do all the Daskhub thing via Helm only.
On IM Dashboard K8S cluster creation, HW Data tab, there are 1 front node, and several workers. Is the front node only used for K8S master component, or will it also host some user pods? This is important to know if a small VM is enough, or if we should use the same flavor as for workers.
Which Kubernetes version should we used? I kept the 1.23 default (not sure if Dask gateway or Jupyterhub is up to date).
Cloud Provider/ Select Site image: I chose Ubuntu Jammy. Is that OK?
For Elastic K8S, we can only chose the maximum of Worker nodes (not the minimum, but this might be the number of workers entered on K8S tab).
EGI.md lacks information on how to connect to the K8S front VM to issue commands (getting the pem file, connecting with cloudadm).
I don't know how to retrieve EGI Check-in secrets other than by searching in previous deployment. Is there a way through https://aai.egi.eu/federation?
My K8S deployment is marked in status "running", not "configured", but it looks like the deployment is terminated? Oh no, it took a long time to end up in "configured" state. I'm not sure all is well though, the prompt on the front node is not the same as on other deployments.

So after step 2 from EGI.md, I created my daskhub.yaml file as below:

dask-gateway:
  enabled: true
  gateway:
    auth:
      jupyterhub:
        apiToken: <token1>
      type: jupyterhub
    extraConfig:
      dasklimits: |
        c.ClusterConfig.cluster_max_cores = 6
        c.ClusterConfig.cluster_max_memory = "24 G"
        c.ClusterConfig.cluster_max_workers = 4
        c.ClusterConfig.idle_timeout = 1800
      optionHandler: |
        from dask_gateway_server.options import Options, Integer, Float, String

        def options_handler(options):
          if ":" not in options.image:
            raise ValueError("When specifying an image you must also provide a tag")
          return {
            "worker_cores": options.worker_cores,
            "worker_memory": int(options.worker_memory * 2 ** 30),
            "image": options.image,
          }

        c.Backend.cluster_options = Options(
          Integer("worker_cores", default=1, min=1, max=4, label="Worker Cores"),
          Float("worker_memory", default=2, min=2, max=8, label="Worker Memory (GiB)"),
          String("image", default="pangeo/ml-notebook:2022.09.21", label="Image"),
          handler=options_handler,
        )
    backend:
      worker:
        cores:
          limit: 4
        memory:
          limit: 8G
        threads: 2
dask-kubernetes:
  enabled: false
jupyterhub:
  hub:
    config:
      GenericOAuthenticator:
        client_id: <client>
        client_secret: <secret>
        oauth_callback_url: https://pangeo-elastic.vm.fedcloud.eu/hub/oauth_callback
        authorize_url: https://aai-dev.egi.eu/auth/realms/egi/protocol/openid-connect/auth
        token_url: https://aai-dev.egi.eu/auth/realms/egi/protocol/openid-connect/token
        userdata_url: https://aai-dev.egi.eu/auth/realms/egi/protocol/openid-connect/userinfo
        login_service: EGI Check-In
        scope:
          - openid
          - email
          - profile
          - eduperson_entitlement
        username_key: preferred_username
        userdata_params:
          state: state
        allowed_groups:
          - urn:mace:egi.eu:group:vo.pangeo.eu:role=member#aai.egi.eu
        claim_groups_key: eduperson_entitlement
      JupyterHub:
        authenticator_class: generic-oauth
    services:
      dask-gateway:
        apiToken: <token1>
  proxy:
    secretToken: <token2>
    service:
      type: ClusterIP
  singleuser:
    cpu:
      guarantee: 1
      limit: 2
    defaultUrl: /lab
    extraEnv:
      DASK_GATEWAY__CLUSTER__OPTIONS__IMAGE: '{JUPYTER_IMAGE_SPEC}'
    image:
      name: pangeo/ml-notebook
      tag: 2022.09.21
    memory:
      guarantee: 2G
      limit: 4G
    startTimeout: 600
    storage:
      capacity: 2Gi
      type: dynamic
rbac:
  enabled: true

and just issued the helm command:

sudo helm upgrade daskhub daskhub --repo=https://helm.dask.org --install --wait --cleanup-on-fail --create-namespace --namespace daskhub --version 2022.8.2 --values daskhub.yaml

Followed by reconfiguring ingress.

The helm command and kubectl for ingress worked with no error. I can see Jupyterhub, but I get a login error:

Same as the other day when trying to access IM Dashboard.

I'll stop there for tonight. If someone can test the deployment at https://pangeo-foss4g.vm.fedcloud.eu and see if they can login?

Edit: correct link is https://pangeo-elastic.vm.fedcloud.eu/

tinaok commented 1 year ago

Thank you @guillaumeeb !! I just tried to log on to https://pangeo-foss4g.vm.fedcloud.eu/jupyterhub/, but may be this one is the foss4g configuration (=old) one ( I have all my historical left to old one there, so I guess this is not the new one you are creating based on the elastic Kubernetes? )

https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md#step-1-dns-name says to create a new domain, but what we really need is a DNS name right, so a new host.

In the example it was pangeo.vm.fedcloud.eu but I guess we can have it like pangeo-eosc.vm.fedcloud.eu ?

I selected Kubernetes, then added Elastic option and Grafana. No other options. I'm planning to do all the Daskhub thing via Helm only.

👍

I checked IP address of the cluster you made from swift dashboard and I could have the grafana login portal

EGI.md lacks information on how to connect to the K8S front VM to issue commands (getting the pem file, connecting with cloudadm).

If I remember right, the button in the IM dashboard indicating cluster configuration "configured" , could be clicked and there was information about cloudadm

guillaumeeb commented 1 year ago

Crap, I didn't indicate the right link.

The correct one is https://pangeo-elastic.vm.fedcloud.eu. This is only a test deployment for now. There is no need to add /jupyterhub/. Jupyterhub is available at /.

tinaok commented 1 year ago

Thanks for the new link! I confirm I get error too 52E898AC-49E8-4532-8973-D22F13986942

guillaumeeb commented 1 year ago

@sebastian-luna-valero do you have any idea why the auth might failed?

I'll have another look at it tonight.

sebastian-luna-valero commented 1 year ago

Hi,

Could you please double check that the Redirect UR in https://aai-dev.egi.eu/federation has the same value as oauth_callback_url in the values.yaml?

guillaumeeb commented 1 year ago

Could you please double check that the Redirect UR in https://aai-dev.egi.eu/federation has the same value as oauth_callback_url in the values.yaml?

I think I get the problem: I simply didn't go through the registration of a new Service in https://aai-dev.egi.eu/federation as I though I could reuse the old Open ID credentials. Could we use the same credentials by adding another Redirect URI to the form your showing? How can we have access to the management of the already existing service?

sebastian-luna-valero commented 1 year ago

Trying to answer some questions:

On IM Dashboard K8S cluster creation, HW Data tab, there are 1 front node, and several workers. Is the front node only used for K8S master component, or will it also host some user pods? This is important to know if a small VM is enough, or if we should use the same flavor as for workers.

I believe user pods end up running on worker nodes. However, for large clusters, I like to have a big flavor for the front-end too (with the master role).

Which Kubernetes version should we used? I kept the 1.23 default (not sure if Dask gateway or Jupyterhub is up to date).

I normally take a copy of the version I choose, trying to be reproducible. Last time I deployed 1.23.11 and was looking good.

Cloud Provider/ Select Site image: I chose Ubuntu Jammy. Is that OK?

It should be ok, but I still prefer to stay with 20.04.

For Elastic K8S, we can only chose the maximum of Worker nodes (not the minimum, but this might be the number of workers entered on K8S tab).

This is something that I wanted to explore myself since I haven't tried it yet. I believe the minimum number of workers can also be configured from CLI, maybe we need to ask to expose this option on IM dashboard as well.

EGI.md lacks information on how to connect to the K8S front VM to issue commands (getting the pem file, connecting with cloudadm).

Correct, would you open a PR with this and other suggestions? Happy to review.

I don't know how to retrieve EGI Check-in secrets other than by searching in previous deployment. Is there a way through https://aai.egi.eu/federation?

Please make sure you use https://aai-dev.egi.eu/federation since you can self-approve your request to add this new service. The secrets are available on that form after the service is approved.

My K8S deployment is marked in status "running", not "configured", but it looks like the deployment is terminated? Oh no, it took a long time to end up in "configured" state. I'm not sure all is well though, the prompt on the front node is not the same as on other deployments.

If you only get $ after ssh'ing into the front-end, that's expected after deployment. If other deployments had a different prompt maybe is because someone reconfigured the default option.

sebastian-luna-valero commented 1 year ago

I think I get the problem: I simply didn't go through the registration of a new Service in https://aai-dev.egi.eu/federation as I though I could reuse the old Open ID credentials. Could we use the same credentials by adding another Redirect URI to the form your showing? How can we have access to the management of the already existing service?

Each service (different URIs) have each own credentials. These credentials are only available to the "service owner", the one adding the config to https://aai-dev.egi.eu/federation, I am afraid.

guillaumeeb commented 1 year ago

Please make sure you use https://aai-dev.egi.eu/federation since you can self-approve your request to add this new service. The secrets are available on that form after the service is approved.

Each service (different URIs) have each own credentials. These credentials are only available to the "service owner", the one adding the config to https://aai-dev.egi.eu/federation, I am afraid.

Okay, trying this now!

Correct, would you open a PR with this and other suggestions? Happy to review.

Yes I'm planning to do this once everything works fine.

Thanks for every answers, that's really helpful!

guillaumeeb commented 1 year ago

@sebastian-luna-valero Thanks to your inputs, I created a service in https://aai-dev.egi.eu/federation/egi/services, and self approve it. I think I was able to give you access to this service.

However, it is still pending (Deployment in progress status). Do you know how much time it can take (it's already been about 30min)?

I'm trying with NativeAuthenticator waiting for the OIDC credentials to be OK.

guillaumeeb commented 1 year ago

So the platform seems to be working (Jupyterhub and Dask-gateway), however, I do'nt see any scaling up when I ask for more pods.

I've launch a Dask-gateway cluster, and scaled it. Default platform has only one worker node with 8CPUs, 32GB.

Here are my pods waiting:

$ sudo kubectl get pods -n daskhub
NAME                                                 READY   STATUS    RESTARTS   AGE
api-daskhub-dask-gateway-547c8f684-vmxlw             1/1     Running   0          17m
continuous-image-puller-9kf6h                        1/1     Running   0          23h
controller-daskhub-dask-gateway-6d988656cf-66kht     1/1     Running   0          23h
dask-scheduler-04b0b67bffc840cc9f7bb0dc24c7c350      1/1     Running   0          19m
dask-scheduler-449348c96b1f4f76b530b86a5286bb4a      1/1     Running   0          15m
dask-worker-04b0b67bffc840cc9f7bb0dc24c7c350-2ptp2   1/1     Running   0          19m
dask-worker-04b0b67bffc840cc9f7bb0dc24c7c350-5wnvc   1/1     Running   0          19m
dask-worker-04b0b67bffc840cc9f7bb0dc24c7c350-crjkk   1/1     Running   0          19m
dask-worker-04b0b67bffc840cc9f7bb0dc24c7c350-lh25q   1/1     Running   0          19m
dask-worker-449348c96b1f4f76b530b86a5286bb4a-7w6tv   0/1     Pending   0          4m43s
dask-worker-449348c96b1f4f76b530b86a5286bb4a-9d79d   0/1     Pending   0          4m43s
dask-worker-449348c96b1f4f76b530b86a5286bb4a-cgs8t   0/1     Pending   0          15m
dask-worker-449348c96b1f4f76b530b86a5286bb4a-k54br   0/1     Pending   0          4m43s
dask-worker-449348c96b1f4f76b530b86a5286bb4a-mhrln   0/1     Pending   0          4m43s
dask-worker-449348c96b1f4f76b530b86a5286bb4a-nq5z9   0/1     Pending   0          15m
dask-worker-449348c96b1f4f76b530b86a5286bb4a-qp52k   0/1     Pending   0          15m
dask-worker-449348c96b1f4f76b530b86a5286bb4a-rk28n   0/1     Pending   0          4m43s
dask-worker-449348c96b1f4f76b530b86a5286bb4a-rm7vt   0/1     Pending   0          15m
dask-worker-449348c96b1f4f76b530b86a5286bb4a-wwvt2   0/1     Pending   0          15m
dask-worker-449348c96b1f4f76b530b86a5286bb4a-zb55p   0/1     Pending   0          15m
dask-worker-449348c96b1f4f76b530b86a5286bb4a-zv2m5   0/1     Pending   0          4m43s

And if I check one Pending pod details:

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  16m                default-scheduler  0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  14m (x1 over 15m)  default-scheduler  0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

But I still have two nodes:

$ sudo kubectl get nodes
NAME                     STATUS   ROLES                  AGE   VERSION
kubeserver.localdomain   Ready    control-plane,master   24h   v1.23.11
vnode-1.localdomain      Ready    <none>                 24h   v1.23.11

Not sure where to look to see where the problem could be.

guillaumeeb commented 1 year ago

I also see

44 updates can be applied immediately.
To see these additional updates run: apt list --upgradable

*** System restart required ***

When connecting to front node. Should I do something about that?

guillaumeeb commented 1 year ago

Correct, would you open a PR with this and other suggestions? Happy to review.

Yes I'm planning to do this once everything works fine.

PR opened at #28.

Still waiting for https://aai-dev.egi.eu/federation/egi/services, my service is still on Deployment in Progress status. Maybe I've done something wrong. @sebastian-luna-valero if you have any hint.

I also tested manual scaling using IM Dashboard on this new deployment, this worked well.

sebastian-luna-valero commented 1 year ago

However, it is still pending (Deployment in progress status). Do you know how much time it can take (it's already been about 30min)?

My bad, we should use https://aai.egi.eu/federation (instead of https://aai-dev.egi.eu/federation). Then select Development in the Integration Environment option. This was actually correct in https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md

Please try again, and this should solve the issue (i.e. after a few seconds, the service will be automatically Deployed).

So the platform seems to be working (Jupyterhub and Dask-gateway), however, I do'nt see any scaling up when I ask for more pods.

I am double checking why that would be the case in: https://github.com/grycap/clues/issues/114#issuecomment-1263492314

System restart required

Ideally we should update and restart all the VMs in the cluster before the initial deployment, and then periodically to apply updates to the underlying operating system. This implies a downtime so I would it immediately before the workshop and immediately afterwards.

I also tested manual scaling using IM Dashboard on this new deployment, this worked well.

Great!

guillaumeeb commented 1 year ago

Please try again, and this should solve the issue (i.e. after a few seconds, the service will be automatically Deployed).

This worked! Thanks a lot.

So that leaves us with the Elastic functionality not working. I just tried again by scaling a Dask cluster, but I still have pods that are not able to be scheduled due to insufficient resources, and no new nodes incoming.

@sebastian-luna-valero maybe we should open a new issue in https://github.com/grycap/clues rather than adding a comment in an existing issue? cc @micafer.

Everyone should be able to login to the https://pangeo-elastic.vm.fedcloud.eu deployment. I've just noted a display error using Dask-gateway when displaying cluster object in the notebook, but it's probably some minor bug.

micafer commented 1 year ago

So that leaves us with the Elastic functionality not working. I just tried again by scaling a Dask cluster, but I still have pods that are not able to be scheduled due to insufficient resources, and no new nodes incoming.

@sebastian-luna-valero maybe we should open a new issue in https://github.com/grycap/clues rather than adding a comment in an existing issue? cc @micafer.

Please send me an email with the detailed problem ans we can try to debug the issue.

tinaok commented 1 year ago

@guillaumeeb

Everyone should be able to login to the https://pangeo-elastic.vm.fedcloud.eu deployment. I've just noted a display error using Dask-gateway when displaying cluster object in the notebook, but it's probably some minor bug.

I logged in looks great! thank you @guillaumeeb!! I didn't test dask yet but I do not see the cloud bucket? Is it the same Pangeo notebook docker image as we used for https://pangeo-foss4g.vm.fedcloud.eu/ infrastructure?

guillaumeeb commented 1 year ago

You mean the S3 browser on the left side bar? Yes, not sure why it is not there.

I used pangeo/ml-notebook in the last available version. See y'all file in https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md. I cannot check right now, but we might have used pangeo-notebook image in the other deployment. I thought that ml-notebook was more complete, but I may probably be wrong. This can be easily changed!

Other than that, it's the exact same deployment of Daskhub, so there won't be more functionalities. As the elastic part of kubernetes is not working currently, there is no interest of using this deployment instead of the other one.

sebastian-luna-valero commented 1 year ago

I think https://github.com/IBM/jupyterlab-s3-browser needs to be added explicitly (and we could update https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md accordingly)

We are currently trying to solve the elastic k8s option, and we will report back here.

In the meantime, manual scaling up and down is the best option.

tinaok commented 1 year ago

Hi, I think the foss4g configuration was based on Pangeo-notebook docker image and not on the ml notebook. the purpose was not to use too much resources. @j34ni or @annefou can you please confirm?

@sebastian-luna-valero

Once the automatic scaling up works, existing dask hub need to be destroyed and re-created to have it benefit from it?

j34ni commented 1 year ago

Yes, it used the pangeo-notebook:latest

sebastian-luna-valero commented 1 year ago

Yes, we would need to redeploy to get the elasticity.

guillaumeeb commented 1 year ago

I just redeployed the Daskhub with the pangeo/pangeo-notebook Docker image, and the S3 Object Storage Browser tab is back!

So now we're in the same setup, we just need to wait and see if we manage to get elasticity working.

Also, I still encounter an error when starting a dask-gateway cluster, but this does not prevent using it. I'll open an issue on the pangeo-docker repo to get some feedback. This is probably due to the last version of the image. We can pin a previous one if needed.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File /srv/conda/envs/notebook/lib/python3.9/site-packages/IPython/core/formatters.py:921, in IPythonDisplayFormatter.__call__(self, obj)
    919 method = get_real_method(obj, self.print_method)
    920 if method is not None:
--> 921     method()
    922     return True

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask_gateway/client.py:1225, in GatewayCluster._ipython_display_(self, **kwargs)
   1223 widget = self._widget()
   1224 if widget is not None:
-> 1225     return widget._ipython_display_(**kwargs)
   1226 else:
   1227     from IPython.display import display

AttributeError: 'VBox' object has no attribute '_ipython_display_'

guillaumeeb commented 1 year ago

And I confirm that ml-notebook image does not have s3-browser installed, see https://github.com/pangeo-data/pangeo-docker-images/issues/383.

sebastian-luna-valero commented 1 year ago

Do we want to keep pangeo/ml-notebook in https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md or should we replace it with pangeo/pangeo-notebook?

guillaumeeb commented 1 year ago

We should replace it for now, feel free to do it!

sebastian-luna-valero commented 1 year ago

Sure! https://github.com/pangeo-data/pangeo-eosc/pull/29

guillaumeeb commented 1 year ago

Just for information, I'm going to delete the pangeo-elastic infrastructure and create a new one using operational IM Dashboard instance after discussing with @micafer.

guillaumeeb commented 1 year ago

I just redeployed the pangeo-elastic infrastructure. I see an improvement, but there are still things to debug for elasticity to be working.

pangeo-data / pangeo-eosc

Creating pangeo-eosc infrastructure based on elastic Kubernetes Virtual Cluster using IM-Dashboard #22