Optimise resource - Githubissues

tinaok commented 1 year ago

We have limited VCPUs resource allocated. When we create clusters, even if we do not use it, we are consuming the resource.
As long as we are using 'non elastic' Kubernets cluster, we need to manually control this.

I just logged on openstack dashboard and we have no tutorial sessions going on but we use 240VCPUS now.

I tried to shutdown from IM Dashboard but I do not find any cluster in my interface.

How can I shut them down ?

tinaok commented 1 year ago

@sebastian-luna-valero I think it would be a good idea for IM-dashboard, that all EGI VO admins have access to clusters created automatically.

guillaumeeb commented 1 year ago

This is clearly something we should optimize! We need to defines how many resources we want now and in the following weeks/months. How many VMs do we need without event?

And then we need to clarify how to do it, and who can. We can probably do something through Openstack UI too.

Of course, ideally we should advance on elastic kubernetes set-up!

sebastian-luna-valero commented 1 year ago

Under the Change User bullet point you can find how to share access to virtual infrastructure via IM Dashboard: https://docs.egi.eu/users/compute/orchestration/im/dashboard/#list-of-actions

I believe it is better to do this explicitly (i.e. you choose who to share with) rather than automatically for security reasons.

If you plan to use the elastic cluster, I suggest to do as many tests as possible before the upcoming CLIVAR workshop in October. The main aspect we should consider is to disconnect DaskHub from EGI Check-In, allow the native authentication mechanism, and perform stress tests with fake users. Happy to participate and contribute.

Also, as discussed via email, we are happy to offer more computational resources to the vo.pangeo.eu VO, preferably on a different cloud provider to balance the load across the EGI federation. This would also allow to have two deployments up and running for workshops with overlapping occurrence. If we want to go down this route, we would need the required amount of vCPUs, RAM and storage to negotiate access with a new cloud provider as soon as possible.

guillaumeeb commented 1 year ago

The main aspect we should consider is to disconnect DaskHub from EGI Check-In, allow the native authentication mechanism, and perform stress tests with fake users. Happy to participate and contribute.

You propose to do this in order to check that the "Elastic" functionnality work? If so, we can also verify this using Dask Gateway.

With Elastic Kubernetes, can we chose to increase the minimum number of VMs before a workshop, through IM Dashboard or by another mean?

cc @annefou for the more computational resources on a different provider. Not sure how workshops are overlapping in resources need. This might get tricky to handle if we have to Jupyterhub URLs, and we need to copy datasets in two places.

tinaok commented 1 year ago

@sebastian-luna-valero @guillaumeeb I had a meeting with organiser of CLIVER workshop this morning. It is not yet clear the size of data set. we'll need about 30 Jupiter lab instances. But I do not have example of codes which will run, or computation estimate, so I can not estimate the shape of dask worker, neither the number of them we need.

tinaok commented 1 year ago

Still I think it is a good idea to test other infrastructure, Rough estimate I had for disk space is only 10To From former experience with some applications, so lets say

shape of dask workers as each have as 16GB of memory with 4 threads(to 8?),
jupyterlab instance should have at least 16GB & 4 to 8 threads.

at least 2-4 dask workers for each student and 1 jupyterlab ?

sebastian-luna-valero commented 1 year ago

Hi,

You propose to do this in order to check that the "Elastic" functionnality work? If so, we can also verify this using Dask Gateway.

Great, much easier then!

With Elastic Kubernetes, ca we chose to increase the minimum number of VMs before a workshop, through IM Dashboard or by another mean?

I need to investigate this, and will report back. More importantly, we need to plan and test it before the workshop.

Still I think it is a good idea to test other infrastructure,

Great. I will start looking for an alternative provider.

we'll need about 30 Jupiter lab instances. But I do not have example of codes which will run, or computation estimate, so I can not estimate the shape of dask worker, neither the number of them we need.

According to:

vCPUs: (n. of users x n. of JupyterLabs x n. of threads) + (n. of users x n. of dask workers x n. of threads)
RAM: (n. of of users x n. of JupyterLabs x 16 GB) + (n. of users x n. of dask workers x 16 GB)
Storage: 10 TB

For 30 users, 2 dask workers per user, and 4 vCPUs per user/worker we get:

480 vCPUs
1440 GB RAM

Object storage is: 10 TB

I will check what's possible and get back to you.

Best regards, Sebastian

guillaumeeb commented 1 year ago

Getting back to the optimizing current resources part, I have several questions:

Now that we can see our deployment through IM Dashboard, I just tried to delete 2 VMS (8 and 9) from there, but get the error : ERROR Error making terminate op on VM 9: Error Removing resources: Error removing resources: ['No auth data has been specified to OpenStack.']. Is this a known issue? Should I delete those VMs through Openstack? Will it cause any trouble (I don't thin so knowing Kubernetes a bit)?
I was planning to leaving 6 VMs to the current working deployment, has that leaves enough room on our Openstack quota to test Elastic Kubernetes, is that OK for everyone?
I believe we have another Kubernetes cluster / Pangeo deployment without EGI check-in, from the look of Openstack UI, it uses 3 VMs. Should we keep it? Can somebody give me access to it through IM Dashboard?

micafer commented 1 year ago

Hi @guillaumeeb

Now that we can see our deployment through IM Dashboard, I just tried to delete 2 VMS (8 and 9) from there, but get the error : ERROR Error making terminate op on VM 9: Error Removing resources: Error removing resources: ['No auth data has been specified to OpenStack.']. Is this a known issue? Should I delete those VMs through Openstack? Will it cause any trouble (I don't thin so knowing Kubernetes a bit)?

This error means that you do not have correctly defined the credentials to access the cloud site. Could you check that you have the site defined in your credentials section?

I believe we have another Kubernetes cluster / Pangeo deployment without EGI check-in, from the look of Openstack UI, it uses 3 VMs. Should we keep it? Can somebody give me access to it through IM Dashboard?

If you want to deploy a K8s cluster on an OpenStack site (using OpenStack credentials) you only have to add it in the credentials section of the IM-Dashboard setting the needed authentication data.

sebastian-luna-valero commented 1 year ago

Thanks @micafer !

This error means that you do not have correctly defined the credentials to access the cloud site. Could you check that you have the site defined in your credentials section?

@guillaumeeb FYI: https://docs.egi.eu/users/compute/orchestration/im/dashboard/#cloud-credentials

I believe we have another Kubernetes cluster / Pangeo deployment without EGI check-in, from the look of Openstack UI, it uses 3 VMs. Should we keep it? Can somebody give me access to it through IM Dashboard?

I think it was created by @j34ni so he should be able to share the cluster with you.

I was planning to leaving 6 VMs to the current working deployment, has that leaves enough room on our Openstack quota to test Elastic Kubernetes, is that OK for everyone?

Great!

j34ni commented 1 year ago

@guillaumeeb Sorry for not answering earlier, I am tied up with something else I created this other infrastructure with the basic Jupyter sign-in for participants at the FOSS4G workshop who were not able to use the EGI Check-in. It is now reduced in size and we thought about keeping it but if you need more resources and nobody else needs it then you can delete it entirely Send me your token so that I can give you access

guillaumeeb commented 1 year ago

Thanks @micafer

This error means that you do not have correctly defined the credentials to access the cloud site. Could you check that you have the site defined in your credentials section?

@guillaumeeb FYI: https://docs.egi.eu/users/compute/orchestration/im/dashboard/#cloud-credentials

So I have a Cloud cedential for CESNET: Host: https://identity.cloud.muni.cz VO: vo.pangeo.eu

I used it to deploy my own Kubernetes cluster.

However, I didn't have it configure when @j34ni gave me access to the other infrastructure (74ab3cc8-1e2d-11ed-8c48-0ee20d64cb6e ). This infrastructure has a status 'unknown' whereas the other I deployed is 'configured'. I should probably try to achieve a correct status before trying to manage it from IM Dashboard, but I don't know what to do. Could that comes from the fact I used a different ID for CESNET cloud provider than @j34ni used? @j34ni, do you see the foss4g infrastructure in a 'configured' status?

I created this other infrastructure with the basic Jupyter sign-in for participants at the FOSS4G workshop who were not able to use the EGI Check-in. It is now reduced in size and we thought about keeping it but if you need more resources and nobody else needs it then you can delete it entirely

@j34ni I don't need more resources currently, it's perfectly fine for me to keep it. I'll sent you my credentials by email, but it is not mandatory that I see this cluster.

guillaumeeb commented 1 year ago

Hi everyone,

for @annefou @tinaok @j34ni.

So @j34ni deleted the instance without EGI checkin. We currently have the pangeo-foss4g instance running with a lot of resources, and a new pangeo-elastic instance that I deployed to test Elastic Kubernetes. The elastic functionality is not working right now.

What are the plan for the upcoming workshop?

Do you want to:

Continue using pangeo-foss4g as it suits your need and is stable?
Deploy a new pangeo-eosc platform based on the pangeo-elastic one but with more resources?

We could also use the pangeo-elastic directly by adding new nodes, but I intended to keep it for testing purpose.

It should be pretty easy to deploy a new instance with more resources (and users limitations on dask-gateway side: Dask cluster size limits)) if you want.

sebastian-luna-valero commented 1 year ago

I had a meeting with organiser of CLIVER workshop this morning. It is not yet clear the size of data set. we'll need about 30 Jupiter lab instances. But I do not have example of codes which will run, or computation estimate, so I can not estimate the shape of dask worker, neither the number of them we need.

Now that we are closer to the workshop, do we have a better estimate of the required capacity?

I will check what's possible and get back to you.

I am currently struggling to find a new provider on time, so in the end we may simply ask CESNET to increase the available capacity for the CLIVAR workshop, if that's ok. However, the discussion to get a new provider is still on the table and if it's not available for the CLIVAR workshop we will try to have it for the following one. Therefore, the deployment for the CLIVAR workshop can stay at full capacity for longer, even if there is an overlap with the following workshop in November.

Please let me know your thoughts.

annefou commented 1 year ago

Yes I think it is OK to stick to CESNET.

For the infrastructure, maybe we could just "rename" foss4g to pangeo-eosc or similar and add more resources. I think it takes time to add resources and the bootcamp starts next week.

On my side, I am slightly worried because we know little about the datasets and they will most likely download many during the workshop. Do we have storage like for foss4g? They also have a minio instance in Denmark but it may not be very efficient for reading large amount of data.

sebastian-luna-valero commented 1 year ago

Yes, we have 10TB object storage, but write-access is only allowed to Pangeo Admins at the moment. Trainees only have read-only access, is that ok?

Regarding adding resources, should we try to match the amount of resources requested in https://github.com/pangeo-data/pangeo-eosc/issues/21#issuecomment-1261033915 or can you confirm more accurate numbers?

annefou commented 1 year ago

Ok. Read access should be OK. For writing results and other data, they can use their minio.

Yes I think the amount of resources requests in https://github.com/pangeo-data/pangeo-eosc/issues/21#issuecomment-1261033915 is OK for this course. Thanks a lot.

guillaumeeb commented 1 year ago

For the infrastructure, maybe we could just "rename" foss4g to pangeo-eosc or similar and add more resources. I think it takes time to add resources and the bootcamp starts next week.

Renaming can be tricky, and won't be much faster than rebuilding a fresh infrastructure.

If needed, I can deploy a new pangeo-eosc platform tonight or tomorrow. Adding resources don't seem to be a problem, on my testing instance it was pretty fast to add a node (something like a few minutes, less than 10). However, it will need to be validated before we delete the pangeo-foss4g one.

On my side, I cannot add resources on pangeo-foss4g deployment, I don't know if @j34ni can?

tinaok commented 1 year ago

Ok. Read access should be OK. For writing results and other data, they can use their minio.

During the workshop, when users use dask, and probably we will need to work with temporally Zarr files.
Thus users need to read/write Zarr file efficiently.
So user need a private disk space in CESNET.

j34ni commented 1 year ago

I can see the pangeo-elastic infrastructure as red, is it normal ?

I have added 2 hpc.16core-64ram-ssd-ephem nodes (16 CPUs, 64.0 GB of RAM , 80.0 GB of HD) to pangeo-foss4g and they show as "running" (orange), so I guess that it is in the process of working and will hopefully soon be green

If it all turns green, how many more nodes should I add?

As for the name, we have a "pangeo-egi" ready and should in principle be able to switch easily without disturbing "pangeo-eosc"

j34ni commented 1 year ago

@sebastian-luna-valero

Is it possible to increase the size of the disk on an existing infrastructure (currently 931GiB)?

Also, on Openstack I can see a lot of 80GB volumes which are apparently not in use (they must have been a left over from previous infrastructures and/or VMs!?), can/should we delete them?

guillaumeeb commented 1 year ago

Thus users need to read/write Zarr file efficiently. So user need a private disk space in CESNET.

I'm afraid we don't know how to do that currently with CESNET Object storage. We should try to advance https://github.com/pangeo-data/pangeo-eosc/issues/17, but I'm not sure we ca easily answer this need. If I understand correctly, even with https://github.com/pangeo-data/pangeo-eosc/pull/23, users need to have an account on EGI Checkin operational service to have write access using either Swift or S3 interfaces. Is that correct @sebastian-luna-valero?

We could generate either a Swift token or S3 credentials and share them, but this won't be very secure has this means users would be able to delete every bucket we have.

I can see the pangeo-elastic infrastructure as red, is it normal ?

Not really, I have some exchanges with Miguel to try to make Elastic Kubernetes work, maybe the manipulations done had a side effect, I also see the infrastructure as Red, but it's working.

I also still see pangeo-foss4g as unknown on my side.

If it all turns green, how many more nodes should I add? As for the name, we have a "pangeo-egi" ready and should in principle be able to switch easily without disturbing "pangeo-eosc"

Let's wait for @annefou or @tinaok, I guess if we chose to keep the pangeo-foss4g infrastructure, we'll want to add as many nodes as we can, just leaving a few resources for testing.

Is it possible to increase the size of the disk on an existing infrastructure (currently 931GiB)?

I'm not sure of what we are talking about here, are these local volumes attached to VMs? How would you want to use those volumes? I don't feel that using local VM storage (if this is what we are talking about) will solve the temporary storage problem. I'm under the impression we need a storage accessible from every VMs, a shared file system or object storage.

j34ni commented 1 year ago

@guillaumeeb

I was talking about the shared file system

tinaok commented 1 year ago

If it all turns green, how many more nodes should I add? As for the name, we have a "pangeo-egi" ready and should in principle be able to switch easily without disturbing "pangeo-eosc"

Let's wait for @annefou or @tinaok, I guess if we chose to keep the pangeo-foss4g infrastructure, we'll want to add as many nodes as we can, just leaving a few resources for testing.

As long as we can't make the cluster elastic, I think we better keep the pangeo-foss4g infra. But I prefer we 'rename' it. (also by renaming it the data I have in NFS server of pangeo-foss4g will be there!! so it is good ;-) I do not mind if it is pangeo-egi or pangeo-eosc. Please take the one which is most convenient for you.

If it all turns green, how many more nodes should I add?

Please put as much as possible but keep some resource for @guillaumeeb and so on to work on elastic version and may be binder tests.

tinaok commented 1 year ago

https://github.com/pangeo-data/pangeo-eosc/issues/21#issuecomment-1261033915

I have update for number of students /mentors. We'll have 22 student, 14 mentors (including anne and myself) Now I got 20 students and 6 mentors enrolled.
They are directed to test their login and xarray with https://pangeo-foss4g.vm.fedcloud.eu So once @j34ni (?) rename the infrastructure, I'll ask them to change the link.

@guillaumeeb I was talking about the shared file system

I agree for increasing NFS disk space if possible, For some users, they would try creating Zarr in local dask cluster. It is not 'optimal ' parallel computing, but until we fix the possibilities for creating Zarr store in object storage, I think it is good to have this solution.

guillaumeeb commented 1 year ago

Also, on Openstack I can see a lot of 80GB volumes which are apparently not in use (they must have been a left over from previous infrastructures and/or VMs!?), can/should we delete them?

I was talking about the shared file system

@j34ni, what I can tell from Openstack dashboard is that every VM (13 at the moments) uses a 80GB local disk space.

On the two Kubernetes front nodes, the Tosca template also mounts another volume which according to the documentation on IM Dashboard is used to store Kubernetes Persistent Volumes. Those persistent volumes are disk space that can be requested by pods, and are used for example by Jupyterhub to get a persistent volume on the Jupyter notebook pod of each user, mounted on /home/jovyan. This is the volume of 931GiB you're talking about.

However, this space is not mounted and so not visible on Dask worker pods created by dask-gateway. I'm not sure if this is feasible. So this is a space that is shared between users, but not shared in the sense of distributed computing, more a space that is kept between Jupyter sessions. As @tinaok said, we'll only be able to use this space with Dask LocalClusters, and I'm not sure of the performances we can get if many users work on it at the same time.

But I prefer we 'rename' it. (also by renaming it the data I have in NFS server of pangeo-foss4g will be there!! so it is good ;-) I do not mind if it is pangeo-egi or pangeo-eosc. Please take the one which is most convenient for you.

So I guess we'll use pangeo-egi that @j34ni has already booked. I'll let you do the change @j34ni, it looks you know how to do it properly!

Please put as much as possible but keep some resource for @guillaumeeb and so on to work on elastic version and may be binder tests.

For my tests, I'd say 32 cores and 128 GiB is enough! And I already use 24 cores.

tinaok commented 1 year ago

Thank you @guillaumeeb

As @tinaok said, we'll only be able to use this space with Dask LocalClusters, and I'm not sure of the performances we can get if many users work on it at the same time.

I totally agree about performance issue.

I'm trying to run some notebook from https://github.com/pangeo-gallery/cmip6 at our future pangeo-egu , I would probably want to investigate how users can 'add' their own data on cloud so that it can be loaded from dask workers.
I guess I should follow the instruction from #23 ?? (https://github.com/sebastian-luna-valero/pangeo-eosc/blob/egi/EGI-CLI-Swift-S3.md)

tinaok commented 1 year ago

Students will be formed as working groups. (each have 3-4 student, i.e. about 6-7 working groups) They should be sharing same data sets to work. So I guess we can create public cloud bucket for each working group, and we need to give them full access (read/write)?

tinaok commented 1 year ago

(or may be we should create a new issue for organising s3 disk space for cliver workshop?)

annefou commented 1 year ago

@tinaok I will also add some information on how to access CMIP6 data. I can also reuse some information from https://nordicesmhub.github.io/forces-2021/learning/data/CMIP6_data.html

guillaumeeb commented 1 year ago

I would probably want to investigate how users can 'add' their own data on cloud so that it can be loaded from dask workers. I guess I should follow the instruction from https://github.com/pangeo-data/pangeo-eosc/pull/23 ??

Currently, only us can add data on the CESNET object store. #23 only propose to create buckets in another Openstack project, so that in the future we can investigate how to give finer control to other users, but we haven't a solution yet. If you want to add data to the cloud storage, and make some bucket accessible with write access by other users at one point (when we figured how to do so), I think you should follow #23.

So I guess we can create public cloud bucket for each working group, and we need to give them full access (read/write)?

Public is not mandatory if we can give them full access. Maybe we should discuss this in #17? Or a new issue as you propose.

annefou commented 1 year ago

Maybe we should separate input data from data generated by students (that is usually what I do with my courses). With students it can become quickly very messy if they all have write access to "input" data. Mentors are usually in charge of organizing data for their respective groups so maybe mentors could have write access and students read access only do "input" data.

sebastian-luna-valero commented 1 year ago

Hi

Is it possible to increase the size of the disk on an existing infrastructure (currently 931GiB)?

This is the volume mounted on the k8s front-end node and exported as NFS to the k8s worker nodes. I believe this is used as the storage class for jupyterhub.hub.singleuser.storage in https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md

Increasing that might be possible but not easy, and for sure it will imply service disruption. Have you run out of space in there?

Also, on Openstack I can see a lot of 80GB volumes which are apparently not in use (they must have been a left over from previous infrastructures and/or VMs!?), can/should we delete them?

Yes, please delete those with status Available. I think these are coming from k8s nodes deleted manually when shrinking the cluster.

I'm afraid we don't know how to do that currently with CESNET Object storage. We should try to advance https://github.com/pangeo-data/pangeo-eosc/issues/17, but I'm not sure we ca easily answer this need. If I understand correctly, even with https://github.com/pangeo-data/pangeo-eosc/pull/23, users need to have an account on EGI Checkin operational service to have write access using either Swift or S3 interfaces. Is that correct @sebastian-luna-valero?

Correct. As explained in https://github.com/pangeo-data/pangeo-eosc/issues/17#issuecomment-1259099227 Pangeo users need to enroll the vo.pangeo.eu in aai.egi.eu. So far they are enrolling into the aai-dev.egi.eu instance instead.

We could generate either a Swift token or S3 credentials and share them, but this won't be very secure has this means users would be able to delete every bucket we have.

Correct. This is what we are trying to solve before merging https://github.com/pangeo-data/pangeo-eosc/pull/23. Given the tight timeline I see two options: 1) users have read-only access to object storage at CESNET (i.e. we stay as we are now); 2) we allow write access to users, bearing in mind that they will be able to modify everyone's buckets.

sebastian-luna-valero commented 1 year ago

@guillaumeeb @j34ni

Please remember to use the hpc.16core-64ram-ssd-ephem nodes (16 CPUs, 64.0 GB of RAM , 80.0 GB of HD) flavors for k8s nodes.

j34ni commented 1 year ago

I can see no change since yesterday on the IM dashboard (everything is still red), however on openstack it seems that the project is now allocated 1736 vCPUs

Is it a bug?

It would be great if it was true and if we could actually use them though...

tinaok commented 1 year ago

@guillaumeeb

I just benched https://pangeo-foss4g.vm.fedcloud.eu/ and https://pangeo-elastic.vm.fedcloud.eu

with same cmip6 notebook https://github.com/pangeo-gallery/cmip6/blob/master/ECS_Gregory_method.ipynb

elastic has 4GBRAM on each dask worker and dask distribute version 2022.09.01
foss4g has 2GB RAM and dask version 2022.07

I used 4 dask worker (32GB) with elastic, 26 dask worker (52GB( with foss4g,

elastic could handle the work, but foss4g, failed.

According to @keewis dask had big update recently, and probably that plays, and I think dask worker with just 2GB RAM is small for heavy duty.

The foss4g cluster can be updated to recent Pangeo-notebook docker image but keeping all the all data on it? (I mean data on NFS)

sebastian-luna-valero commented 1 year ago

elastic could handle the work, but foss4g, failed.

Interesting, I was just discussing with CESNET about the quotas, and in https://github.com/pangeo-data/pangeo-eosc/issues/21#issuecomment-1260698573 it was asked 2GB RAM per vCPU. Let's see if they can provide 4GB RAM per vCPU instead.

It would be great if it was true and if we could actually use them though...

CESNET may need some additional time to do the checks.

sebastian-luna-valero commented 1 year ago

@tinaok sorry but I would like to clarify what configuration worked for you:

how many dask workers per user? how many CPUs per dask worker? how much RAM per dask worker?

Could you please confirm this info?

tinaok commented 1 year ago

Thank you @sebastian-luna-valero

elastic has ~~4GB~~ 8GB RAM with 2 threads on each dask worker
foss4g has 2GB RAM with 1 thread on each dask worker

The bench shows the dask-hub configuration pb. foss4g have older dask version + small dask worker configuration, thus cluster for cliver needs to be re-created or foss4g needs update...

@guillaumeeb @j34ni is recreating or updating possible for you before the workshop starts?

tinaok commented 1 year ago

@sebastian-luna-valero

how many dask workers per user?

I benched using just one Jupyter lab. one user using

elastic 4 dask worker (4 8GB, 4 2 threads)
foss4g 26 dask worker (26 2GB, 26 1 threads)

guillaumeeb commented 1 year ago

Please remember to use the hpc.16core-64ram-ssd-ephem nodes (16 CPUs, 64.0 GB of RAM , 80.0 GB of HD) flavors for k8s nodes.

@sebastian-luna-valero I used smaller instances for my test of Elastic Kubernetes, I guess this is OK in this case?

elastic has 8GB RAM with 2 threads on each dask worker and dask distribute version 2022.09.01 foss4g has 2GB RAM with 1 thread on each dask worker and dask version 2022.07

@tinaok Okay, this was not intended. We added those lines at one point on the Helm chart values, but I though this was hard limits, and that by default, the values in the configuration options would be used. But it seems that once backend.worker section is set-up in the Yaml file, options values are ignored, even if changing them in Python code.

I will just remove these changes in pangeo-elastic so that we entirely rely on the options part. This way, the two deployments will have a default of 2GiB, but you can pass options and go up to 8GiB in both (and two threads). This is done as follow:

cluster = gateway.new_cluster(worker_memory=8, worker_cores=2)

We can also put a bigger default, like 4GiB RAM per worker.

According to @keewis dask had big update recently, and probably that plays, and I think dask worker with just 2GB RAM is small for heavy duty.

You can already try the above code on pangeo-foss4g instance, it should work and give you the same workers as on pangeo-elastic. This way you'll be able to know if only the memory limits played a role in the failure.

The foss4g cluster can be updated to recent Pangeo-notebook docker image but keeping all the all data on it? (I mean data on NFS) @guillaumeeb @j34ni is recreating or updating possible for you before the workshop starts?

Yes and yes. Updating default Docker image and changing default memory/threads per worker (and limits) can be done with no disruption in a few minutes. I just need to be sure @j34ni does not try to change deployment name and host name at the same time. We also need to agree on correct values, in

        c.Backend.cluster_options = Options(
          Integer("worker_cores", default=1, min=1, max=4, label="Worker Cores"),
          Float("worker_memory", default=2, min=2, max=8, label="Worker Memory (GiB)"),
          String("image", default="pangeo/pangeo-notebook:2022.09.21", label="Image"),
          handler=options_handler,

What should be default, min, max for worker_cores and worker_memory?

guillaumeeb commented 1 year ago

Just took the chance, right now I changed the settings on both infrastructure.

On pangeo-elastic, I removed the backend.worker part and set:

        c.Backend.cluster_options = Options(
          Integer("worker_cores", default=1, min=1, max=4, label="Worker Cores"),
          Float("worker_memory", default=4, min=2, max=12, label="Worker Memory (GiB)"),
          String("image", default="pangeo/pangeo-notebook:2022.09.21", label="Image"),
          handler=options_handler,
        )

This solve the problem of not being able to specify a custom value.

On pangeo-foss4g:

        c.Backend.cluster_options = Options( 
          Integer("worker_cores", default=1, min=1, max=4, label="Worker Cores"),
          Float("worker_memory", default=4, min=2, max=16, label="Worker Memory (GiB)"),
          String("image", default="pangeo/pangeo-notebook:2022.08.24", label="Image"),
          handler=options_handler,
        )

I used a bit older pangeo-notebook image version to avoid the dask-gateway display widget bug. And slightly higher max value for per worker memory limit.

tinaok commented 1 year ago

@guillaumeeb Thanks a lot for the update, the min max value of pangeo-foss4g is exactly what I would like to have ;-) I'll run the notebook and gets back to you

tinaok commented 1 year ago

I have problem with dask cluster output on both foss4g and elastic.
dashboard link is working for both foss4g and elastic

guillaumeeb commented 1 year ago

I guess the pangeo-notebook image I tagged is not old enough. We can try with an older one if you want.

j34ni commented 1 year ago

@annefou @tinaok @guillaumeeb

The issue with the IM dashboard have not been resolved yet, so I cannot make any change to the pangeo-foss4g infrastructure (i.e., adding resources) However it shows as "running" (orange) and nodes 0-9 are "green" (configured) It also seems that it still works and is accessible, and I can see that @guillaumeeb has already updated the values

Is it now OK to change name from pangeo-foss4g to pangeo-clivar?

j34ni commented 1 year ago

@tinaok @annefou @guillaumeeb

I guess the pangeo-notebook image I tagged is not old enough. We can try with an older one if you want.

I changed for (pangeo-foss4g only, since I do not have access to the other infrastructures that @guillaumeeb has set up) to pangeo/pangeo-notebook tag: 2022.08.19 (which is very likely the version we had at FOSS4G) and the error disappeared

j34ni commented 1 year ago

@tinaok @annefou @guillaumeeb

I also did a bit of manual cleaning among the pods left running, there was quite a few of them and we ought to be careful about the resources available

As a reminder, on the machine this requires to find the name of the dask-scheduler(s) and then issue a delete command:


sudo kubectl get pods -n daskhub
sudo kubectl -n daskhub delete pod dask-scheduler-553417911a284304a2df5d9789b56f2c

That will also delete the related dask-worker(s)

If the pod indefinitely remains in the terminating state add --grace-period=0 --force

guillaumeeb commented 1 year ago

Is it now OK to change name from pangeo-foss4g to pangeo-clivar?

@j34ni didn't we talk about pangeo-egi? Appart that the name, it's OK on my side.

I changed to pangeo/pangeo-notebook tag: 2022.08.19 (which is very likely the version we had at FOSS4G) and the error disappeared

:+1:

tinaok commented 1 year ago

Thank you guillaume and jean, I checked at https://us-central1-b.gcp.pangeo.io/ and I have same err FF6CD031-0BBB-48AB-BD2F-9DDA0A1B555E

I’ll come back with the full test with same cmip6 notebook https://github.com/pangeo-gallery/cmip6/blob/master/ECS_Gregory_method.ipynb on elastic and ex-foss4g( @j34ni @guillaumeeb i confirm , we agreed the name as pangeo-egi) to decide if we keep the esthetic err but go for dask 2022.09 for performance or we stay as what we have in ex-foss4g

pangeo-data / pangeo-eosc

Optimise resource #21