leap-stc / data-management

Collection of code to manually populate the persistent cloud bucket with data
https://catalog.leap.columbia.edu/
Apache License 2.0
0 stars 6 forks source link

New Dataset CHIRPS #164

Open ks905383 opened 3 days ago

ks905383 commented 3 days ago

Dataset Name

Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS)

Dataset URL

http://data.chc.ucsb.edu/products/CHIRPS-2.0/

Description

CHIRPS is a near-global gridded rainfall observations product that blends stations with satellite data, and is considered one of the most accurate gridded rainfall products for many parts of the world, especially for large parts of Sub-Saharan Africa. CHIRPS is widely used - its Nature Scientific Data article has ~5000 citations on google scholar. Its uses include climate model evaluation, understanding climate dynamics, and helping support climate services efforts.

Size

~1GB files for high-resolution daily, global data, so ~50GB in total

License

Creative Commons ("To the extent possible under the law, Pete Peterson has waived all copyright and related or neighboring rights to CHIRPS. CHIRPS data is in the public domain as registered with Creative Commons." at https://chc.ucsb.edu/data/chirps)

Data Format

NetCDF

Data Format (other)

No response

Access protocol

FTP

Source File Organization

Data is stored in a single directory per resolution / geographic scale, with one NetCDF file per year.

Example URLs

http://data.chc.ucsb.edu/products/CHIRPS-2.0/global_daily/netcdf/p05/
http://data.chc.ucsb.edu/products/CHIRPS-2.0/global_daily/netcdf/p05/chirps-v2.0.1981.days_p05.nc

Authorization

None

Transformation / Processing

My personal preference generally is for files to follow CMIP standard variable / dimension names (lon / lat / pr instead of longitude/latitude/precip in the CHIRPS files), but I'm not sure if that's in line with LEAP's philosophy on how to store / standardize files.

Otherwise, the main thing would be to just to collate across the by-year files.

Target Format

Zarr

Comments

There are a few different sets of data, stored by resolution and geographic extent. The global_* files are the only ones necessary in the zarr contexts, but perhaps multiple spatiotemporal resolution versions could be helpful to store as well. (global_monthly and global_daily for sure, and maybe both spatial resolutions available for each as well).

norlandrhagen commented 3 days ago

This looks like a great candidate @ks905383! I'll make sure to bring it up at tomorrows @leap-stc/data-and-compute meeting.

One question for you, do you think the default chunking structure of this dataset would work for your analysis? If so, we could consider creating a virtualizarr virtual dataset . This would behave just like a Zarr, but we don't copy any of the NetCDFs. Just an option, because we could also make a Zarr store for a dataset of size. Anyhow, something to consider.

I'll probably have some more questions for you in the coming week 🎊

jbusecke commented 3 days ago

Just tried to download the sample file, and this server is slooooooow

image

This makes me think that the virtualized data might perform poorly. I would probably vote for a copy (especially since this data is not that big?). Happy to chat more.

norlandrhagen commented 3 days ago

Good check @jbusecke!

jbusecke commented 3 days ago

I started a feedstock

jbusecke commented 3 days ago

I think our internet here is generally shite...maybe confirm the above finding on your end @norlandrhagen ?

ks905383 commented 3 days ago

Sounds good! This is my first ingestion request, so I'm not quite sure yet what's useful and what's not, but I have downloading/preprocessing code for this dataset that I'm of course happy to share if it is useful.

norlandrhagen commented 3 days ago

I think our internet here is generally shite...maybe confirm the above finding on your end @norlandrhagen ?

I'm getting like 20MB/s downloading that NetCDF on rando coffee shop internet.

jbusecke commented 3 days ago

@ks905383 yes that would definitely be useful. I will prototype the 'bare' dataset ingestion, but if you have steps that enhance the useablility (renaming etc), we can absolutely accommodate those a preprocessing stage.

jbusecke commented 3 days ago

Ideally if you can apply them to the test dataset, that would help a lot in applying this to the recipe.