judithberner / climpred_CESM1_S2S

MIT License
3 stars 1 forks source link

The final workflow #3

Closed aaronspring closed 2 years ago

aaronspring commented 3 years ago

I would like us to state visions of how the workflow with S2S / SubX data from/in the cloud should be in the end / summer.

Once SubX data is in the cloud, how will users work with them? This is about data access and compute.

Ryan outlines 4 ways for CMIP6 data analogously here https://medium.com/pangeo/cmip6-in-the-cloud-five-ways-96b177abe396

way 2 uses cloud data but calculation happens somewhere else, e.g. personal laptop or supercomputer but both have to download data from cloud first. Takes time for large data and maybe costs.

way 3 computes cloud data in the cloud. No downloading needed. This would be the gold standard for the summer school IMO. It would be really interactive, easy and quick to add a new variable to analysis. Do you think you can make this possible @judithberner? (Costs and top-level organisation)? https://2i2c.org/ is a new nonprofit enabling this I guess (Ryan and pangeo is involved).

once way 3 works, getting to way 4 with dask is easy and already implemented in climpred.

aaronspring commented 3 years ago

My example notebook how data access on the cloud could look like: https://nbviewer.jupyter.org/gist/aaronspring/65fdd121bc7353740063818c528e82eb

judithberner commented 3 years ago

That's a good question and the answer is: "I don't know". How most scientists in academia would use this is way 2 - at least for now. This would also be the backup for the summer plus we could have intermediate files (e.g. anomalies) on disk. For the summer school we could consider going down route 3 and have resources to ask to use e.g. https://2i2c.org/ However, we need to keep in mind this is not an HPC summer school. Not sure what top-level organization is needed. Can you summarize what 4 is, please? Does this influence the workflow? If so, we should ask our IRI collaborators.

judithberner commented 3 years ago

I read up on 4. Just to be clear - we use dask on the supercomputer.

aaronspring commented 3 years ago

Would you get in touch with 2i2c?

If anyone wants to compute skill over the whole globe, downloading to a different machine would take ages...

way 4 would be using more resources. More than one node on a supercomputer or at least using cpus one a node in parallelised mode.

judithberner commented 3 years ago

Frankly, I think 10min of telecon would be helpful. We are in touch with 2i2c, but would need to better understand the cost (since we are using several dask servers). There are advantages of using the NCAR machines, e.g., we could precompute the anomaly fields and e.g. weekly averages, so there is less need of pre-processing for the students. Is this a separate issue, or will this influence the zarr-workflow? If latter, we should ask the wider community. As for now, I would prefer an approach which would allow for both.

aaronspring commented 3 years ago

Yep telco would work for me now or tomorrow.

aaronspring commented 3 years ago

Will we have all SubX models on the NCAR machines? Do we have that much space?

If so, that would be easiest but then we wouldn't actually need to put SubX into the cloud for the ASP.

As a general rule: we are good if data and compute are in the same place. Either both cloud or both ncar machine.

judithberner commented 3 years ago

I am tagging @aneeshcs here, who is my the summer colloquium co-organizer. My suggestion is that we have let's say TS and precip and u850 on disk for some models. For the tutorial that should be plenty. And if people want to do something across models, they can move to the cloud data, which is there partly for the tutorial, but really also for writing paper afterwards and for the general S2S community. My guess is that the general community will "download" the data 80% of the time, as we are not yet that familiar with cloud-computing. But e.g. you could give a talk at the workshop demonstrating cloud computing using 2i2c, that would educate the next generation and help 2i2c. I think the basic question we have to answer here is if the workflow would be different. I do understand your point, but the goal here is education and not efficiency.

aaronspring commented 3 years ago

partly for the tutorial, but really also for writing paper afterwards and for the general S2S community.

sounds good

you could give a talk at the workshop demonstrating cloud computing using 2i2c

no problem, I will

I think the basic question we have to answer here is if the workflow would be different. I do understand your point, but the goal here is education and not efficiency.

If someone wants to do a global full map skill and has to download 200 GB per variable and model, that's a lot and takes time. and basically this can be achieved now at no effort at IRI (just slower). I wouldnt want to do this personally. The workflow is not different. the jupyter notebooks will be nearly identical, except for the line where you load the data. in both cloud and locally, the new user has to setup a conda environment and install climpred. conda is very confusing in the beginning. on the cloud/2i2c that can be pre-configured, which I see as an advantage.

hi @aneeshcs I think we briefly met on OceanScience, when you spoke to Riley

aneeshcs commented 3 years ago

Hi @aaronspring , Yes, I remember meeting you at Ocean Sciences! Nice to meet you again (this time virtually!).

Thanks, @judithberner , for adding me on here. I'll try and catch up on the conversations and will comment if I have anything to add.

aaronspring commented 3 years ago

Side note on data access: Because searching for climate data and downloading is annoying, I catalogued a few climate datasets here with intake https://github.com/aaronspring/remote_climate_data, including NCEP 6h

judithberner commented 3 years ago

I think having an 2i2c account would also be good for e.g. when the computer is down. @judithberner Do we have to wonder about intellectual property or can we just develop in the cloud?

aaronspring commented 3 years ago

What kind of intellectual property do you mean? In the cloud does not necessarily mean open in the public. Pangeo cloud doesn’t have such issues I think

judithberner commented 3 years ago

Just signed up for https://2i2c.org/ It seems they are not yet providing services.