leap-stc / data-management

Collection of code to manually populate the persistent cloud bucket with data
https://catalog.leap.columbia.edu/
Apache License 2.0
0 stars 5 forks source link

Add published ClimSim Dataset via pangeo forge #28

Closed jbusecke closed 1 year ago

jbusecke commented 1 year ago

Thanks to a lot of work by @sungdukyu @jerrylin96 and many others, the E3SM-MMF(ne4) aka ClimSim dataset is now available on huggingface.

I would like to reingest it via pangeo-forge to make this workflow entirely reproducible!

shoyer commented 1 year ago

We (at Google) would be excited to be able to play around with the ARCO version of ClimSim. I think this could make for an interesting training dataset for some of our ML weather models :)

sungdukyu commented 1 year ago

@shoyer (Sorry for the dumb question.) What is "ARCO"?

rabernat commented 1 year ago

"Analysis-Ready, Cloud Optimized". See this paper for a longer definition.

So basically:

That's the goal here.

sungdukyu commented 1 year ago

@rabernat Thanks for the clarification. One challenge would be the gigantic size of the high-res dataset, ~41TB. (BTW, we already have the 'Zarr'-ed real-go low-res dataset in LEAP cloud storage.)

rabernat commented 1 year ago

One challenge would be the gigantic size of the high-res dataset

This is not a problem in any way. On the contrary, this is exactly the type of dataset that benefits most from the ARCO approach. 🚀

For comparison, Google's ARCO ERA5 is hundreds of TB.

sungdukyu commented 1 year ago

Yes, absolutely, Zarr's the way to go for these huge datasets. However, what I meant was processing netcdf files (>400k files totaling 41TB) to a single Zarr file. It took quite a long time just for a low res dataset (same file number but much smaller size at several 100s GB). Maybe there's a more efficient way of doing this, I guess.

rabernat commented 1 year ago

Maybe there's a more efficient way of doing this, I guess.

@sungdukyu 🤝 Meet Pangeo Forge!

sungdukyu commented 1 year ago

@rabernat Thanks! Seems like my bakery was suboptimal. (https://pangeo-forge.org/pangeo-forge-diagram.png)

jbusecke commented 1 year ago

Since we have the right audience here, would it make sense to explore uploading a zarr store directly to huggingface?

rabernat commented 1 year ago

Let's not try to mix in huggingface-related development work. Let's just try to ship the ClimSim dataset as fast as possible by whatever means necessary.

jbusecke commented 1 year ago

I can try to spend some time on this next week! In case somebody wants to start before that, starting a recipe PR in our LEAP feedstock would be the way to go.

cisaacstern commented 1 year ago

@jbusecke could you push the work we began together on this as a PR?

I have some time I can devote to this and would love to pick up where we left off.

jbusecke commented 1 year ago

See #33, sorry for the delay.