CLIVAR bootcamp configuration for working group EOSC cloud

tinaok commented 1 year ago

Following #21 Now tutorial session at CLIVAR bootcamp has finished, I would like to create separate issue on EOSC configuration for CLIVAR bootcamp working session.

We are forming now about 6-7 working groups, and will everyone will run jupyter lab with dask workers. (we are about 30 users including students and mentors) We have https://hackmd.io/@pangeo/clivar-2022 so that each working groups to declare the data set and size they need and we try to estimate the RAM & cores needed to run their computation.

Q:

To add more cores& RAMS, we need to delete actual https://pangeo-clivar.vm.fedcloud.eu/ ? If so if we create new cluster like https://pangeo-clivar-WS.vm.fedcloud.eu/ ? I'll ask users to download their notebook and move them to new cluster, the working group is forming now so if we do that it would be nice to do that this week.

Q: They have asked us how long they can use this infrastructure. @sebastian-luna-valero ?

Q: If I understood correct, @sebastian-luna-valero you prefer that we use nodes without SSD for having bigger infrastructure?

Q: We have some missing package like xmip. @guillaumeeb is it possible to add that ?

Thank you everyone!

guillaumeeb commented 1 year ago

To add more cores& RAMS, we need to delete actual https://pangeo-clivar.vm.fedcloud.eu/ ?

I don't think we need, especially with the new available flavor. And I don't think we want to migrate now to another infrastructure.

We have some missing package like xmip. @guillaumeeb is it possible to add that ?

If you want the new packages to be there by default (for Jupyter notebook and on every workers), we need to build a new docker image extending pangeo-notebook as in https://github.com/guillaumeeb/pangeo-docker, and then deploy it on the infrastructure. This will need a bit of time for building and testing, and we need to determine where to push this new image (which Dockerhub org or profile). If there is just a few package, the easiest is to pip install them and use the Dask Pip Plugin.

sebastian-luna-valero commented 1 year ago

Q: To add more cores& RAMS, we need to delete actual https://pangeo-clivar.vm.fedcloud.eu/ ?

I agree with @guillaumeeb , I think we can simply grow the existing cluster.

Q: They have asked us how long they can use this infrastructure. @sebastian-luna-valero ?

I will check with CESNET. @tinaok would you need the cluster as big as possible even after the training has finished? or can we shrink it after the 21st Oct?

Q: If I understood correct, @sebastian-luna-valero you prefer that we use nodes without SSD for having bigger infrastructure?

Yes, we need to stick to the elixir.16core-64ram flavor. Please add as many as you can to serve the training activities.

Again, according to what I see in OpenStack, it looks like this can be easily done by adding new nodes to the cluster via IM Dashboard.

sebastian-luna-valero commented 1 year ago

From #21 I see this additional question:

I starts to have information back from attendee about the size of data we need, and data CMIP6 data which are missing in google cloud. I was wondering if some data missing are already in some EOSC cloud (if possible in cesnet??)

I would like to ping my colleague @bschumac from https://c-scale.eu/.

@bschumac could you please check if the data requirements in https://hackmd.io/@pangeo/clivar-2022 can be provided by C-SCALE?

More context: I was planning to invite you to submit an application via https://c-scale.eu/call-for-use-cases/ to complement the resources offered by EGI-ACE. However, this month I wanted to focus on the CLIVAR workshop. Should we schedule a meeting by the end of October to discuss further?

tinaok commented 1 year ago

Q: To add more cores& RAMS, we need to delete actual https://pangeo-clivar.vm.fedcloud.eu/ ?

I agree with @guillaumeeb , I think we can simply grow the existing cluster. Thank you @sebastian-luna-valero and @guillaumeeb It is clear to me! So, lets slowly increase it?

tinaok commented 1 year ago

Q: They have asked us how long they can use this infrastructure. @sebastian-luna-valero ?

I will check with CESNET. @tinaok would you need the cluster as big as possible even after the training has finished? or can we shrink it after the 21st Oct?

Yes we can shrink it, but for some working group they will continue to work on it until they can have enough result.
Some might take longer time as expected if the CMIP data they are looking for are not accessible from cloud. Those groups, when they need to compute large scale analysis, we might ask you to extend the cluster for computation.

tinaok commented 1 year ago

From #21 I see this additional question:

I starts to have information back from attendee about the size of data we need, and data CMIP6 data which are missing in google cloud. I was wondering if some data missing are already in some EOSC cloud (if possible in cesnet??)

I would like to ping my colleague @bschumac from https://c-scale.eu/.

@bschumac could you please check if the data requirements in https://hackmd.io/@pangeo/clivar-2022 can be provided by C-SCALE?

This is exciting!! I just looked at https://c-scale.eu Where can I find the catalogue of data which are available on this infrastructure?

More context: I was planning to invite you to submit an application via https://c-scale.eu/call-for-use-cases/ to complement the resources offered by EGI-ACE. However, this month I wanted to focus on the CLIVAR workshop. Should we schedule a meeting by the end of October to discuss further?

There is Pangeo-forge which is adapted to cloud data store to store CMIP6 data and publishing the catalogue of data which users can access easily from their workflow (plz refer this notebook). Let's schedule a meeting!

guillaumeeb commented 1 year ago

So, lets slowly increase it?

@j34ni can probably try to expand the cluster more!

j34ni commented 1 year ago

I have already added 6 elixir.16core-64ram VMs to pangeo-clivar (one of them had been working for 2 days, I had not noticed that it was not an hpc.16core-64ram-ssd-ephem) and trying to add a few more It is getting much slower now, with lots of timeouts, so we may have reached the limit of what was actually available for us?

sebastian-luna-valero commented 1 year ago

Yes we can shrink it, but for some working group they will continue to work on it until they can have enough result.

Ok, I am checking with CESNET what's possible.

Those groups, when they need to compute large scale analysis, we might ask you to extend the cluster for computation.

@guillaumeeb is troubleshooting the elastic kubernetes deployment that will help in this case.

we may have reached the limit of what was actually available for us?

I think you will get the error No valid host was found in that case. What error message do you get?

This is exciting!! I just looked at https://c-scale.eu/ Let's schedule a meeting!

On second thoughts, another (faster) option might be that you directly submit the expression of interest in: https://c-scale.eu/call-for-use-cases/

That will trigger the internal procedure to onboard you as a C-SCALE use case, and that will imply having a kick-off meeting anyways after the use case is approved. So if you directly submit your expression of interest, we can just do one meeting instead of two!

j34ni commented 1 year ago

@sebastian-luna-valero

What error message do you get?

There is no error message: the VMs were created and are shown as running on openstack and the IM Dashboard is all green but none of these new VMs has been added

Maybe is there some hardcoded maximum number of nodes an infrastructure can have?

sebastian-luna-valero commented 1 year ago

@micafer could you please look at https://github.com/pangeo-data/pangeo-eosc/issues/38#issuecomment-1277436720 and https://github.com/pangeo-data/pangeo-eosc/issues/38#issuecomment-1277803296 and suggest ideas of what could be the problem?

sebastian-luna-valero commented 1 year ago

I starts to have information back from attendee about the size of data we need, and data CMIP6 data which are missing in google cloud. I was wondering if some data missing are already in some EOSC cloud (if possible in cesnet??)

I am sorry but the requested data is not available at CESNET or C-SCALE.

guillaumeeb commented 1 year ago

Those groups, when they need to compute large scale analysis, we might ask you to extend the cluster for computation.

@guillaumeeb is troubleshooting the elastic kubernetes deployment that will help in this case.

But if we want to go Elastic, we'll need to redeploy the whole cluster, I don't think this is something we can do while users are working on it. It would also mean losing all the stuff stored in their Home dir.

tinaok commented 1 year ago

I am sorry but the requested data is not available at CESNET or C-SCALE.

Thank you @sebastian-luna-valero for checking it. If we want it in EOSC, this means we'll need to do as you suggested;

submit the expression of interest in: https://c-scale.eu/call-for-use-cases/

Would they support downloading and transforming the data set? The best is we use Pangeo-forge, a tool that download and transform it to ARCO format.
( and may be we should make other issue for discussing data set hosting infrastructure...)

tinaok commented 1 year ago

But if we want to go Elastic, we'll need to redeploy the whole cluster, I don't think this is something we can do while users are working on it. It would also mean losing all the stuff stored in their Home dir.

I agree with @guillaumeeb about it.
In case we move to elastic, we need two cluster(actual one and elastic) running at the same time, and I need to tell users to 'push' their code to GitHub, save data to s3, then move to the new cluster ( and probably need to keep the old one with minimum configuration (like limiting Dask Gateway) for a while until users finish migration.

Now it is just a beginning of usage, and until Saturday morning I'm in the bootcamp (bootcamp finishes next week) so it might be easy to move...

From Sunday there will be lots of working session, so cluster will be more active.

sebastian-luna-valero commented 1 year ago

Would they support downloading and transforming the data set? The best is we use Pangeo-forge, a tool that download and transform it to ARCO format. ( and may be we should make other issue for discussing data set hosting infrastructure...)

C-SCALE would support that. We just need the specific steps on how to do this. These steps/requirements should also be included in the expression of interest submitted via https://c-scale.eu/call-for-use-cases/

micafer commented 1 year ago

@sebastian-luna-valero

What error message do you get?

There is no error message: the VMs were created and are shown as running on openstack and the IM Dashboard is all green but none of these new VMs has been added

Maybe is there some hardcoded maximum number of nodes an infrastructure can have?

It is strange that the VMs are green in the Dashboard and they are not added to the cluster. We need to debug this.

marionalberty commented 1 year ago

We have some missing package like xmip. @guillaumeeb is it possible to add that ?

If you want the new packages to be there by default (for Jupyter notebook and on every workers), we need to build a new docker image extending pangeo-notebook as in https://github.com/guillaumeeb/pangeo-docker, and then deploy it on the infrastructure. This will need a bit of time for building and testing, and we need to determine where to push this new image (which Dockerhub org or profile). If there is just a few package, the easiest is to pip install them and use the Dask Pip Plugin.

Thanks for the response on this @guillaumeeb. We just need a few packages (just xMIP and cmocean). Do you have more specific instructions for the Dask Pip Plugin? I'm not familiar with this tool.

tinaok commented 1 year ago

@marionalberty You can try

from distributed.diagnostics.plugin import PipInstall

extra_packages=["xmip","cmocean"]

plugin=PipInstall(extra_packages,restart=True)
client.register_worker_plugin(plugin)

You will find example herehere

sebastian-luna-valero commented 1 year ago

Closing as obsolete.

pangeo-data / pangeo-eosc

CLIVAR bootcamp configuration for working group EOSC cloud #38