leap-stc / data-management

Collection of code to manually populate the persistent cloud bucket with data
https://catalog.leap.columbia.edu/
Apache License 2.0
0 stars 6 forks source link

New Dataset CHIRPS #164

Open ks905383 opened 1 month ago

ks905383 commented 1 month ago

Dataset Name

Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS)

Dataset URL

http://data.chc.ucsb.edu/products/CHIRPS-2.0/

Description

CHIRPS is a near-global gridded rainfall observations product that blends stations with satellite data, and is considered one of the most accurate gridded rainfall products for many parts of the world, especially for large parts of Sub-Saharan Africa. CHIRPS is widely used - its Nature Scientific Data article has ~5000 citations on google scholar. Its uses include climate model evaluation, understanding climate dynamics, and helping support climate services efforts.

Size

~1GB files for high-resolution daily, global data, so ~50GB in total

License

Creative Commons ("To the extent possible under the law, Pete Peterson has waived all copyright and related or neighboring rights to CHIRPS. CHIRPS data is in the public domain as registered with Creative Commons." at https://chc.ucsb.edu/data/chirps)

Data Format

NetCDF

Data Format (other)

No response

Access protocol

FTP

Source File Organization

Data is stored in a single directory per resolution / geographic scale, with one NetCDF file per year.

Example URLs

http://data.chc.ucsb.edu/products/CHIRPS-2.0/global_daily/netcdf/p05/
http://data.chc.ucsb.edu/products/CHIRPS-2.0/global_daily/netcdf/p05/chirps-v2.0.1981.days_p05.nc

Authorization

None

Transformation / Processing

My personal preference generally is for files to follow CMIP standard variable / dimension names (lon / lat / pr instead of longitude/latitude/precip in the CHIRPS files), but I'm not sure if that's in line with LEAP's philosophy on how to store / standardize files.

Otherwise, the main thing would be to just to collate across the by-year files.

Target Format

Zarr

Comments

There are a few different sets of data, stored by resolution and geographic extent. The global_* files are the only ones necessary in the zarr contexts, but perhaps multiple spatiotemporal resolution versions could be helpful to store as well. (global_monthly and global_daily for sure, and maybe both spatial resolutions available for each as well).

norlandrhagen commented 1 month ago

This looks like a great candidate @ks905383! I'll make sure to bring it up at tomorrows @leap-stc/data-and-compute meeting.

One question for you, do you think the default chunking structure of this dataset would work for your analysis? If so, we could consider creating a virtualizarr virtual dataset . This would behave just like a Zarr, but we don't copy any of the NetCDFs. Just an option, because we could also make a Zarr store for a dataset of size. Anyhow, something to consider.

I'll probably have some more questions for you in the coming week 🎊

jbusecke commented 1 month ago

Just tried to download the sample file, and this server is slooooooow

image

This makes me think that the virtualized data might perform poorly. I would probably vote for a copy (especially since this data is not that big?). Happy to chat more.

norlandrhagen commented 1 month ago

Good check @jbusecke!

jbusecke commented 1 month ago

I started a feedstock

jbusecke commented 1 month ago

I think our internet here is generally shite...maybe confirm the above finding on your end @norlandrhagen ?

ks905383 commented 1 month ago

Sounds good! This is my first ingestion request, so I'm not quite sure yet what's useful and what's not, but I have downloading/preprocessing code for this dataset that I'm of course happy to share if it is useful.

norlandrhagen commented 1 month ago

I think our internet here is generally shite...maybe confirm the above finding on your end @norlandrhagen ?

I'm getting like 20MB/s downloading that NetCDF on rando coffee shop internet.

jbusecke commented 1 month ago

@ks905383 yes that would definitely be useful. I will prototype the 'bare' dataset ingestion, but if you have steps that enhance the useablility (renaming etc), we can absolutely accommodate those a preprocessing stage.

jbusecke commented 1 month ago

Ideally if you can apply them to the test dataset, that would help a lot in applying this to the recipe.

jbusecke commented 4 weeks ago

@ks905383 should we ingest the full data as is for now? Or do you think you'll have time to work on the preprocessing with us this week? No problem if not, we can always redo this (the beauty of a fully reproducible pipeline!).

ks905383 commented 4 weeks ago

Yes - sorry for not responding earlier. Maybe best to ingest for now.

I think there's still a question on whether to implement the preprocessing code... There's nothing wrong with the original data per se, it's just that I like the data I use to follow CMIP standards (lat, lon, pr instead of latitude, longitude, precip in this case), and that's pretty much all that my preprocessing code for this does - there aren't big data issues that need to be fixed. So, on the one hand, it would make it easier for me personally, but perhaps more confusing for others who are already using CHIRPS in other contexts? What do you think?

jbusecke commented 4 weeks ago

That sounds good, and from what you describe these fixes can be applied in a lightweight way after loading the data (renaming only?).

jbusecke commented 4 weeks ago

Ill merge and run the whole thing. Will ping this when the dataset is in the catalog.

ks905383 commented 4 weeks ago

That sounds good, and from what you describe these fixes can be applied in a lightweight way after loading the data (renaming only?).

Yeah pretty much!

jbusecke commented 3 weeks ago

As soon as the server is back up we will ingest the full thing and close this.