Open ks905383 opened 1 month ago
This looks like a great candidate @ks905383! I'll make sure to bring it up at tomorrows @leap-stc/data-and-compute meeting.
One question for you, do you think the default chunking structure of this dataset would work for your analysis? If so, we could consider creating a virtualizarr virtual dataset . This would behave just like a Zarr, but we don't copy any of the NetCDFs. Just an option, because we could also make a Zarr store for a dataset of size. Anyhow, something to consider.
I'll probably have some more questions for you in the coming week 🎊
Just tried to download the sample file, and this server is slooooooow
This makes me think that the virtualized data might perform poorly. I would probably vote for a copy (especially since this data is not that big?). Happy to chat more.
Good check @jbusecke!
I think our internet here is generally shite...maybe confirm the above finding on your end @norlandrhagen ?
Sounds good! This is my first ingestion request, so I'm not quite sure yet what's useful and what's not, but I have downloading/preprocessing code for this dataset that I'm of course happy to share if it is useful.
I think our internet here is generally shite...maybe confirm the above finding on your end @norlandrhagen ?
I'm getting like 20MB/s downloading that NetCDF on rando coffee shop internet.
@ks905383 yes that would definitely be useful. I will prototype the 'bare' dataset ingestion, but if you have steps that enhance the useablility (renaming etc), we can absolutely accommodate those a preprocessing stage.
Ideally if you can apply them to the test dataset, that would help a lot in applying this to the recipe.
@ks905383 should we ingest the full data as is for now? Or do you think you'll have time to work on the preprocessing with us this week? No problem if not, we can always redo this (the beauty of a fully reproducible pipeline!).
Yes - sorry for not responding earlier. Maybe best to ingest for now.
I think there's still a question on whether to implement the preprocessing code... There's nothing wrong with the original data per se, it's just that I like the data I use to follow CMIP standards (lat
, lon
, pr
instead of latitude
, longitude
, precip
in this case), and that's pretty much all that my preprocessing code for this does - there aren't big data issues that need to be fixed. So, on the one hand, it would make it easier for me personally, but perhaps more confusing for others who are already using CHIRPS in other contexts? What do you think?
That sounds good, and from what you describe these fixes can be applied in a lightweight way after loading the data (renaming only?).
Ill merge and run the whole thing. Will ping this when the dataset is in the catalog.
That sounds good, and from what you describe these fixes can be applied in a lightweight way after loading the data (renaming only?).
Yeah pretty much!
As soon as the server is back up we will ingest the full thing and close this.
Dataset Name
Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS)
Dataset URL
http://data.chc.ucsb.edu/products/CHIRPS-2.0/
Description
CHIRPS is a near-global gridded rainfall observations product that blends stations with satellite data, and is considered one of the most accurate gridded rainfall products for many parts of the world, especially for large parts of Sub-Saharan Africa. CHIRPS is widely used - its Nature Scientific Data article has ~5000 citations on google scholar. Its uses include climate model evaluation, understanding climate dynamics, and helping support climate services efforts.
Size
~1GB files for high-resolution daily, global data, so ~50GB in total
License
Creative Commons ("To the extent possible under the law, Pete Peterson has waived all copyright and related or neighboring rights to CHIRPS. CHIRPS data is in the public domain as registered with Creative Commons." at https://chc.ucsb.edu/data/chirps)
Data Format
NetCDF
Data Format (other)
No response
Access protocol
FTP
Source File Organization
Data is stored in a single directory per resolution / geographic scale, with one NetCDF file per year.
Example URLs
Authorization
None
Transformation / Processing
My personal preference generally is for files to follow CMIP standard variable / dimension names (
lon
/lat
/pr
instead oflongitude
/latitude
/precip
in the CHIRPS files), but I'm not sure if that's in line with LEAP's philosophy on how to store / standardize files.Otherwise, the main thing would be to just to collate across the by-year files.
Target Format
Zarr
Comments
There are a few different sets of data, stored by resolution and geographic extent. The
global_*
files are the only ones necessary in the zarr contexts, but perhaps multiple spatiotemporal resolution versions could be helpful to store as well. (global_monthly
andglobal_daily
for sure, and maybe both spatial resolutions available for each as well).