Closed annefou closed 1 year ago
Clivar is ending very soon and we have another course coming.
Great to know, thanks!
do we keep the CLIVAR Jupyterhub and reuse it for this eScience course?
I discussed with CESNET and after 21st Oct we need to scale down the CLIVAR JupyterHub, removing first the worker nodes with the hpc.16core-64ram-ssd-ephem
flavors. Specifically, CESNET would like to have back 15 out of the 20 hpc.16core-64ram-ssd-ephem
nodes that we are currently using, as they are being requested by other research groups. If @j34ni and @guillaumeeb struggle to identify which worker nodes to remove from the cluster, please contact me.
will we have a new (elastic) infra?
Tests are still ongoing between @guillaumeeb, Miguel and myself. I would say that manual scaling is still preferred.
Very similar needs e.g. 17 students + 10 mentors
According to what I see in grafana, we should be covered after removing 15 VMs of the hpc.16core-64ram-ssd-ephem
flavor for this upcoming workshop. We could add more elixir.16core-64ram
nodes instead, but as I see that we will have less users, we will check as we go.
Additionally, please remember to submit an application to https://c-scale.eu/call-for-use-cases/ to gain access to additional resources to what you have in EGI-ACE so we could host multiple JupyterHub/DaskHub instances at the same time. Actually, I would be greateful if you could spread the link to the C-SCALE call through your networks (i.e. the European Pangeo community) so others can also benefit from these resources.
do we keep the CLIVAR Jupyterhub and reuse it for this eScience course?
I discussed with CESNET and after 21st Oct we need to scale down the CLIVAR JupyterHub, removing first the worker nodes with the hpc.16core-64ram-ssd-ephem flavors. Specifically, CESNET would like to have back 15 out of the 20 hpc.16core-64ram-ssd-ephem nodes that we are currently using, as they are being requested by other research groups. If @j34ni and @guillaumeeb struggle to identify which worker nodes to remove from the cluster, please contact me.
I would suggest to deploy a new pangeo-eosc
or pangeo-egi
fresh infrastructure, even if Elastic scaling is still not completely working. I assume we can stop clues2 service to avoid auto-scaling issue if needed? But continuing with the other infrastructure is fine too.
Any volunteers for teaching (online) Dask and/or kerchunk? (I understood they are flexible but should probably happen 1st or 2nd November).
Sorry, I won't be available for that.
I started to look into building a new infrastructure, similar to the pangeo-clivar but using only the elixir.16core-64ram
flavor
That works, but I found the deployment process really much slower than with the hpc.16core-64ram-ssd-ephem
, so there are performance differences
@sebastian-luna-valero: I would have liked to try also the hpc.30core-64ram
flavor but the only possible values for the Number of CPUs
in the IM Dashboard are 2, 4, 8, 16, 32 and 64: is it possible to include 30 cores?
@tinaok @annefou: the possible Size of the disk to be attached
is limited to 2TB, is that sufficient for you?
Thanks for starting the creation of the new jupyterhub for the eScience course. On my side, I have created (duplicated from clivar workshop) https://github.com/pangeo-data/escience-2022
I just ask the course organisers for the storage. I see that for the clivar bootcamp, they had 1TB and it is not full. I guess 2TB is OK.
They may also be able to create minIO for reading some data from their own infrastructure (let's see if it works out).
Thanks.
Any volunteers for teaching (online) Dask and/or kerchunk? (I understood they are flexible but should probably happen 1st or 2nd November).
Sorry, I won't be available for that neither.
do we keep the CLIVAR Jupyterhub and reuse it for this eScience course or will we have a new (elastic) infra?
Is it possible to separate the object storage with the usage of CLIVAR?
CLIVAR bootcamp finished but working groups are continuing their work. And they are saving zarr/netcdf files in pangeo storage at CESNET. If the training is one shot and the data can be deleted for this science course, may be better to use one shot MINIO disk storage? Then it is safe and the CLIVAR users who are continue working does not have risk that their files get deleted.
For computing resource, may be I can make a meeting with each workgroup to understand when they will work intensively so that we can adjust the needs manually in advance? (and thank you jean for creating separating instance!)
You are right, we need to make the storage separate. For the eScience course, they will work until the end of November. I guess we also want to onboard them more "permanently" e.g. we would like to provide a more long-term solution but it is probably only viable once the elastic part is in place.
@sebastian-luna-valero: I would have liked to try also the hpc.30core-64ram flavor but the only possible values for the Number of CPUs in the IM Dashboard are 2, 4, 8, 16, 32 and 64: is it possible to include 30 cores?
We could request the 30 vCPU option to be added but the hpc.30core-64ram
flavor has ~2 GB RAM per vCPU core, instead of 4 GB per vCPU core in elixir.16core-64ram
. I think we agreed previously that the 4 to 1 ratio is better than 2 to 1.
That works, but I found the deployment process really much slower than with the hpc.16core-64ram-ssd-ephem, so there are performance differences
For me the keyword here is deployment
. Have you also run notebooks inside the new cluster to check performance of these nodes after deployment? this would need to be compared with performance of the same notebook in the clivar deployment.
Is it possible to separate the object storage with the usage of CLIVAR?
Object storage is detached from JupyterHub deployments. NFS storage is attached to JupyterHub deployments. As long as users are using object storage, we should be fine.
we would like to provide a more long-term solution but it is probably only viable once the elastic part is in place.
Remember that until we get automatic elasticiy in place, manually scaling up and down the cluster is possible.
@sebastian-luna-valero
Is it possible to separate the object storage with the usage of CLIVAR?
Object storage is detached from JupyterHub deployments. NFS storage is attached to JupyterHub deployments. As long as users are using object storage, we should be fine.
Sorry I was not clear enough. I was hoping that we do not give access to the e-science course students for vopangeo.eu in aai.egi.eu for accessing s3 storage just now (untie #17 is resolved), then they can write into 'ANY OF' vo.pangeo.eu-swift disk space. It can have some unfortunate 'delete/overwrite' from eScience course students (or vise-versa) Thats why I thought may be better to separate the s3 access (if Anne plan to give writing access to s3 storage) like using external MinIO server or such. (which does not stay for long time)
I see, thanks!
If other object storage is not available, and until #17 is solved, we could also look into deploying our own MinIO. However, there is the extra effort required to deploy and maintain this operational, and I am not sure whether I will have the time. What about others?
@sebastian-luna-valero @tinaok @annefou
I did a quick test with dask_introduction.ipynb
and the computation times are comparable between pangeo-eosc and pangeo-clivar, although I am not certain that the latter used hpc.16core-64ram-ssd-ephem
and not the same elixir.16core-64ram
(which would explain the similarity)...
Downloads were a lot slower however, but that could be related to the network?!
If there are no other dramatic losses in performances I guess that it should be fine for the eScience course?
This eosc infrastructure now has 16 WNs and the same values as clivar except for the amount of memory which is increased
the computation times are comparable between pangeo-eosc and pangeo-clivar
great!
Downloads were a lot slower however, but that could be related to the network?!
Indeed, I would say so.
This eosc infrastructure now has 16 WNs and the same values as clivar except for the amount of memory which is increased
Looking at OpenStack I see:
31 VMs for pangeo-clivar, out of which 20 are hpc.16core-64ram-ssd-ephem
. So the worker nodes delete from pange-clivar have all been of the flavor elixir.16core-64ram
. Please remember that instead we need to delete worker nodes with the flavor hpc.16core-64ram-ssd-ephem
and replace them with elixir.16core-64ram
.
25 VMs for pangeo-eosc, all of them with the flavor elixir.16core-64ram
. Great!
@j34ni please replace hpc.16core-64ram-ssd-ephem
with elixir.16core-64ram
. Happy to help if you need me!
I just ask the course organisers for the storage. I see that for the clivar bootcamp, they had 1TB and it is not full. I guess 2TB is OK.
As this storage is only use for home directory, I don't think we need so much. Are you planning to put huge scientific datasets there?
If other object storage is not available, and until https://github.com/pangeo-data/pangeo-eosc/issues/17 is solved, we could also look into deploying our own MinIO. However, there is the extra effort required to deploy and maintain this operational, and I am not sure whether I will have the time. What about others?
I won't have the time either. Would the solution to create yet another Openstack project for the escience workshop to host data containers here would be feasible? We cloud reduce object store quotas on both projects if needed.
I started to look into building a new infrastructure,
@j34ni, I see that the jupyterhub is available at https://pangeo-eosc.vm.fedcloud.eu/jupyterhub/, so I guess you did not use the latest configuration provided here: https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md? This is not crutial, but the setup is a bit simplified in this documentation.
@sebastian-luna-valero @tinaok @annefou
Should we fiddle with pangeo-clivar now, as it is being used, or should we wait before starting to remove the hpc.16core-64ram-ssd-ephem
VMs (and then replace them by elixir.16core-64ram
or not)?
@guillaumeeb Sorry I missed that and simply took the current clivar_values.yaml to reuse for pangeo-eosc
Should we fiddle with pangeo-clivar now, as it is being used, or should we wait before starting to remove the hpc.16core-64ram-ssd-ephem VMs (and then replace them by elixir.16core-64ram or not)?
The pangeo-clivar cluster went down from 49 VMs to 31 VMs already, have you noticed any disruption?
We agreed with CESNET the hpc.16core-64ram-ssd-ephem
nodes until today EOB, but we would need to give them back as soon as possible.
Again, I am here to help if needed.
I won't have the time either. Would the solution to create yet another Ppenstack project for the escience workshop to host data containers here would be feasible? We cloud reduce object store quotas on both projects if needed.
Currently the quotas are 10TB for each project. Please confirm the new quota value and I will double check with CESNET.
@sebastian-luna-valero
The pangeo-clivar cluster went down from 49 VMs to 31 VMs already, have you noticed any disruption?
I am not sure these 49 - 31 = 18 VMs were configured/used at all in the infrastructure, they never showed up in the list of VMs in the IM Dashboard anyway, so I deleted them in openstack manually
If I start to remove VMs from the instrastructure while they are in use the affected users will not be very happy
@tinaok When will be a good time to do that?
I won't have the time either. Would the solution to create yet another Ppenstack project for the escience workshop to host data containers here would be feasible? We cloud reduce object store quotas on both projects if needed.
Currently the quotas are 10TB for each project. Please confirm the new quota value and I will double check with CESNET.
The quota was just a proposition to see if we could create another Openstack project to have an object storage space that has different access policy. Imagine a pangeo-escience Openstack project, maybe we would need to create another user group too on check-in?
If I start to remove VMs from the instrastructure while they are in use the affected users will not be very happy
According to grafana, the cluster is quite now, and it's Friday afternoon, I would say it's a good time to reconfigure the cluster.
The quota was just a proposition to see if we could create another Openstack project to have an object storage space that has different access policy. Imagine a pangeo-escience Openstack project, maybe we would need to create another user group too on check-in?
Sure, we can create a new group in check-in dedicated for the new object store. Please note that this would imply that every time a new user requests to enroll in the VO for the eScience Course, VO managers would have to manually add them to the new group. If that's not a problem for you, we can do it.
Anyway, we would need to update OS_PROJECT_ID
in https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI-CLI-Swift-S3.md so we could also create a dedicated page for the eScience Course with the new project ID, and users will simply use that.
@sebastian-luna-valero
According to grafana, the cluster is quite now, and it's Friday afternoon, I would say it's a good time to reconfigure the cluster.
OK, I'll give it a go
@sebastian-luna-valero
Now I see a lot more VMs for pangeo-clivar on the IM Dashboard than when I refreshed a few minutes ago and these 18 "ghost" are suddenly back, what happened?
I am sorry, but I can't check since don't have access to the pangeo-{clivar,eosc} clusters on my profile of IM Dashboard. As a last resort, I can send you my details so you can add me as owner to check further.
@sebastian-luna-valero
Please do send me your credentials I started by deleting the oldest HPC VMs and the IM Dashboard does not like it
Will do. By the way:
@sebastian-luna-valero
they are both on the dev
The quota was just a proposition to see if we could create another Openstack project to have an object storage space that has different access policy. Imagine a pangeo-escience Openstack project, maybe we would need to create another user group too on check-in?
CESNET is happy to create another OpenStack project. There are two options
vo.pangeo.eu in aai.egi.eu/escience
group to the new OpenStack project:Create/destroy VMs | Object Storage at OpenStack: project "vo.pangeo.eu" | Object Storage at OpenStack: project "vo.pangeo.eu" | Object Storage at OpenStack: project "vo.pangeo.eu-swift" | Object Storage at OpenStack: project "vo.pangeo.eu-swift" | Object Storage at OpenStack: project "vo.pangeo.eu-escience" | Object Storage at OpenStack: project "vo.pangeo.eu-escience" | |
---|---|---|---|---|---|---|---|
Virtual Organisation | Public bucket | Private bucket | Public bucket | Private bucket | Public bucket | Private bucket | |
member of vo.pangeo.eu in aai.egi.eu/pangeo.admins | yes | read/write access | read/write access | read-write access | read/write access | read-only | no access |
member of vo.pangeo.eu in aai.egi.eu | no | read-only | no access | read/write access | read/write access | read-only | no access |
member of vo.pangeo.eu in aai.egi.eu/escience | no | read-only | no access | read-only | no access | read/write access | read/write access |
member of vo.pangeo.eu in aai-dev.egi.eu | no | read-only | no access | read-only | no access | read-only | no access |
None | no | read-only | no access | read-only | no access | read-only | no access |
Create/destroy VMs | Object Storage at OpenStack: project "vo.pangeo.eu" | Object Storage at OpenStack: project "vo.pangeo.eu" | Object Storage at OpenStack: project "vo.pangeo.eu-swift" | Object Storage at OpenStack: project "vo.pangeo.eu-swift" | Object Storage at OpenStack: project "vo.pangeo.eu-escience" | Object Storage at OpenStack: project "vo.pangeo.eu-escience" | |
---|---|---|---|---|---|---|---|
Virtual Organisation | Public bucket | Private bucket | Public bucket | Private bucket | Public bucket | Private bucket | |
member of vo.pangeo.eu in aai.egi.eu/pangeo.admins | yes | read/write access | read/write access | read-write access | read/write access | read/write access | read/write access |
member of vo.pangeo.eu in aai.egi.eu | no | read-only | no access | read/write access | read/write access | read/write access | read/write access |
member of vo.pangeo.eu in aai-dev.egi.eu | no | read-only | no access | read-only | no access | read-only | no access |
None | no | read-only | no access | read-only | no access | read-only | no access |
Please let me know your thoughts.
I think option 1 is best. Thank you!
Thanks, please review https://github.com/pangeo-data/pangeo-eosc/pull/44
Closing as obsolete.
Clivar is ending very soon and we have another course coming. Very similar needs e.g. 17 students + 10 mentors (see https://www.aces.su.se/research/projects/escience-tools-in-climate-science-linking-observations-with-modelling/). I will create a repo (same as for Clivar); the organisers would also be happy if someone can deliver some of the trainings (mostly focusing on Dask + kerchunk because the rest is covered in-house).