Open ks905383 opened 3 days ago
This looks like a great candidate @ks905383! I'll make sure to bring it up at tomorrows @leap-stc/data-and-compute meeting.
One question for you, do you think the default chunking structure of this dataset would work for your analysis? If so, we could consider creating a virtualizarr virtual dataset . This would behave just like a Zarr, but we don't copy any of the NetCDFs. Just an option, because we could also make a Zarr store for a dataset of size. Anyhow, something to consider.
I'll probably have some more questions for you in the coming week 🎊
Just tried to download the sample file, and this server is slooooooow
This makes me think that the virtualized data might perform poorly. I would probably vote for a copy (especially since this data is not that big?). Happy to chat more.
Good check @jbusecke!
I think our internet here is generally shite...maybe confirm the above finding on your end @norlandrhagen ?
Sounds good! This is my first ingestion request, so I'm not quite sure yet what's useful and what's not, but I have downloading/preprocessing code for this dataset that I'm of course happy to share if it is useful.
I think our internet here is generally shite...maybe confirm the above finding on your end @norlandrhagen ?
I'm getting like 20MB/s downloading that NetCDF on rando coffee shop internet.
@ks905383 yes that would definitely be useful. I will prototype the 'bare' dataset ingestion, but if you have steps that enhance the useablility (renaming etc), we can absolutely accommodate those a preprocessing stage.
Ideally if you can apply them to the test dataset, that would help a lot in applying this to the recipe.
Dataset Name
Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS)
Dataset URL
http://data.chc.ucsb.edu/products/CHIRPS-2.0/
Description
CHIRPS is a near-global gridded rainfall observations product that blends stations with satellite data, and is considered one of the most accurate gridded rainfall products for many parts of the world, especially for large parts of Sub-Saharan Africa. CHIRPS is widely used - its Nature Scientific Data article has ~5000 citations on google scholar. Its uses include climate model evaluation, understanding climate dynamics, and helping support climate services efforts.
Size
~1GB files for high-resolution daily, global data, so ~50GB in total
License
Creative Commons ("To the extent possible under the law, Pete Peterson has waived all copyright and related or neighboring rights to CHIRPS. CHIRPS data is in the public domain as registered with Creative Commons." at https://chc.ucsb.edu/data/chirps)
Data Format
NetCDF
Data Format (other)
No response
Access protocol
FTP
Source File Organization
Data is stored in a single directory per resolution / geographic scale, with one NetCDF file per year.
Example URLs
Authorization
None
Transformation / Processing
My personal preference generally is for files to follow CMIP standard variable / dimension names (
lon
/lat
/pr
instead oflongitude
/latitude
/precip
in the CHIRPS files), but I'm not sure if that's in line with LEAP's philosophy on how to store / standardize files.Otherwise, the main thing would be to just to collate across the by-year files.
Target Format
Zarr
Comments
There are a few different sets of data, stored by resolution and geographic extent. The
global_*
files are the only ones necessary in the zarr contexts, but perhaps multiple spatiotemporal resolution versions could be helpful to store as well. (global_monthly
andglobal_daily
for sure, and maybe both spatial resolutions available for each as well).