pangeo-data / pangeo-eosc

Pangeo for the European Open Science cloud
https://pangeo-data.github.io/pangeo-eosc/
MIT License
3 stars 3 forks source link

Optimise resource #21

Closed tinaok closed 1 year ago

tinaok commented 1 year ago

We have limited VCPUs resource allocated. When we create clusters, even if we do not use it, we are consuming the resource.
As long as we are using 'non elastic' Kubernets cluster, we need to manually control this.

I just logged on openstack dashboard and we have no tutorial sessions going on but we use 240VCPUS now.

I tried to shutdown from IM Dashboard but I do not find any cluster in my interface.

How can I shut them down ?

j34ni commented 1 year ago

@guillaumeeb @annefou @tinaok

Switch from pangeo-foss4g to pangeo-clivar done

j34ni commented 1 year ago
Screenshot 2022-10-08 at 10 07 19
j34ni commented 1 year ago

we agreed the name as pangeo-egi

I was under the impression that pangeo-eosc would eventually become the "permanent" name and that for this CLIVAR workshop something like pangeo-clivar was more suited than pangeo-egi (and hence took the liberty to rename it)

However, we can always change the name to whatever you want

annefou commented 1 year ago

Should we give the new address https://pangeo-clivar.vm.fedcloud.eu/jupyterhub/hub/home to the workshop attendees & mentors? or do you plan to make additional changes?

j34ni commented 1 year ago

Yes, I think that you can communicate this address

The only "change" that could be made (in terms of infrastructure) will be to add nodes as soon as the blocked IP address problem has been resolved, but this is not something we can fix ourselves so it may not happen before the workshop

sebastian-luna-valero commented 1 year ago

@sebastian-luna-valero I used smaller instances for my test of Elastic Kubernetes, I guess this is OK in this case?

It's ok for tests, but as I mentioned over email we need to think carefully about the VM flavor if we want the workshop to go smoothly. Please see this spreadsheet and maybe let's discuss on a separate issue.

This way, the two deployments will have a default of 2GiB, but you can pass options and go up to 8GiB in both (and two threads).

I would be very much in favor of fixing the amount of vCPUS/RAM per dask worker so we have a predicable amount of capacity, to avoid capacity problems.

tinaok commented 1 year ago

@sebastian-luna-valero

This way, the two deployments will have a default of 2GiB, but you can pass options and go up to 8GiB in both (and two threads). I would be very much in favor of fixing the amount of vCPUS/RAM per dask worker so we have a predicable amount of capacity, to avoid capacity problems.

Some computation requires more memory for each dask worker than number of threads. I think it is preferable to keep this kind of flexibility for optimisation of resource?

sebastian-luna-valero commented 1 year ago

Some computation requires more memory for each dask worker than number of threads. I think it is preferable to keep this kind of flexibility for optimisation of resource?

I understand, thanks!

As long as we allocate enough capacity for the maximum amount of requested resources, we should be ok.

Let's continue the discussion in https://github.com/pangeo-data/pangeo-eosc/issues/34

tinaok commented 1 year ago

@j34ni @guillaumeeb

I’ll come back with the full test with same cmip6 notebook https://github.com/pangeo-gallery/cmip6/blob/master/ECS_Gregory_method.ipynb on elastic and ex-foss4g( @j34ni @guillaumeeb i confirm , we agreed the name as pangeo-egi) to decide if we keep the esthetic err but go for dask 2022.09 for performance or we stay as what we have in ex-foss4g

we can stay with actual version of clivar (ex-foss4g) I used 4 workers each with cluster = gateway.new_cluster(worker_memory=8, worker_cores=2) and on the clivar configuration now the benchmark does not fail. So I conclude the pb came from memory size of worker (2G) was too small for this computation.

I'll continue running the notebooks at clivar-2022/tutorial/examples/notebooks/ in the clivar infrastructure, and if there is other anomaly I'll get back to you.

tinaok commented 1 year ago

Hi @sebastian-luna-valero @j34ni @guillaumeeb

Tutorial session @ clivar bootcamp all finished!! Thank you very much for all your work!! We will start using the infrastructure for working group from tomorrow.

I would like to understand that if we need to add more nodes, we need to delete actual https://pangeo-clivar.vm.fedcloud.eu and create new one?

I starts to have information back from attendee about the size of data we need, and data CMIP6 data which are missing in google cloud. I was wondering if some data missing are already in some EOSC cloud (if possible in cesnet??) Here is the hackmd https://hackmd.io/@pangeo/clivar-2022 (I propose a new threads about 'configuration of clover-bookcamp working group infrastructure on EOSC cloud' )

sebastian-luna-valero commented 1 year ago

Closing as obsolete.