Lazily reading GRIB from cloud object storage?

JackKelly commented 1 month ago

Hi! gribberish looks great! Please may I ask a naive question: Is it possible to lazily read a GRIB directly from cloud object storage? For example, if I only want to read a single message from a GRIB file that contains many messages? (If that question even makes sense! I'm quite new to the inner workings of GRIB!)

mpiannucci commented 1 month ago

I can give two answers here outlining how it currently works and how it could be improved, in hopes this sparks some ideas and you can describe if this doesn't fit your use case.

Currently, the core API is that you feed in a byte array, and the rust Message struct scans and gathers the GRIB sections and then uses those sections to do things. So the way i use it with cloud object storage is that I store the offsets and metadata provided by the IDX sidecars, then when I need the data, i pull down only the offset I need for a given message, parse the single message, then extract the data.

This is not truly lazy because it expects all of the data from the message and does not implement the reader interface like it should, but it keeps everything straightforward for the current time. I can imagine and API where you stream data and the sections are loaded on demand, or only the sections relevant to the data need to be downloaded.

All of that is to say, that in python, this package currently works with xarray and kerchunk and will work fine if only the bytes for a single message in a given GRIB file is provided.

If you want to explain your use case and if you think this can be improved that would be useful possibly. Thanks for reaching out!

JackKelly commented 1 month ago

Thanks loads for the quick & detailed reply!

store the offsets and metadata provided by the IDX sidecars, then when I need the data, i pull down only the offset I need for a given message, parse the single message, then extract the data.

That sounds good to me! (for now, at least)

My use-case

To zoom all the way out: For the last 5 years, we've been trying to train large ML models on NWP and satellite data at the non-profit that I co-founded, Open Climate Fix. To train our ML models, we need to feed thousands of ML training examples into the ML model per second. Each example is typically a fixed-size crop of NWP and satellite data. For example, we might take a crop with shape x=64 * y=64 * 48 hours * 16 NWP variables. Each example will typically have a random start time and a random geospatial location. And we need to sustain a throughput of a few gigabytes per second to each GPU (to be specific: we need a few gigabytes per second of the 64x64x48x16 crops. So we'll need a higher throughput across the network because we'll throw away a significant portion of the data we read from disk, because the "chunk size" of the GRIBs don't match the chunk sizes of our ML training examples).

My "dream" is to be able to train ML models directly from NWPs already on cloud object storage. And to do so as efficiently as possible (in terms of memory & CPU & network utilisation), so we have enough CPU cycles left over to do some simple transforms of the data on-the-fly.

As a data user, I'd like to be able to lazily open the entire NODD GEFS dataset with xr.open_dataset(URL), and load thousands of random crops per second, and for it to "just work" :slightly_smiling_face:. I'm happy to rent a VM with a 200 Gbps network interface card, if necessary.

Are existing tools (kerchunk + gribberish + xarray) already capable of this? (I must admit that I haven't tried yet!) If not, I'm excited to help make this a reality, and I'd love advice on how best to help (I'm comfortable in Rust & Python).

(BTW, here's a draft blog post which goes into more detail. But this blog post will almost certainly change! I'm still learning about recent developments in this field!)

mpiannucci commented 1 month ago

So here is an example notebook shoring how to use gribberish to create a kerchunked GEFS dataset you can lazy load with xarray and gribberish. I think this hits the basics of what you are looking for. Notably, this does not operate on the IDX files, davids work on those workflows should be motsly compatible with gribberish though.

https://github.com/mpiannucci/gribberish/blob/main/python/examples/kerchunk_gefs_wave.ipynb

Let me know if you have any questions!

JackKelly commented 1 month ago

Awesome, thank you! Do you have a feel for how "near-optimal" this existing solution is? Does this solution achieve a throughput that's close to the hardware's theoretical max throughput? (No worries if not! I will do some experiments myself!)

mpiannucci commented 1 month ago

in my experience, the worst part of this whole process dealing with xarray not having async controls, but if you have an optimized dask pipeline you can overcome. Zarr 3 will be a big deal for this.

Beyond that im not totally sure, I mostly deal with building web services so a lot of my performance knowledge is biased toward building those systems

mpiannucci / gribberish

Lazily reading GRIB from cloud object storage? #52

My use-case