nsidc / earthaccess

Python Library for NASA Earthdata APIs
https://earthaccess.readthedocs.io/
MIT License
402 stars 81 forks source link

Opening virtual datasets with NASA dmrpp files #605

Open ayushnag opened 3 months ago

ayushnag commented 3 months ago

The idea is to speed up opening up netcdf4/hdf5 datasets with a NASA specific optimization. Load data like xr.open_mfdataset with kerchunk/zarr speeds by translating existing dmr++ metadata files to zarr metadata on the fly. Much more context and discussion here.

virtualizarr PR for the parser here

earthaccess PR here

earthaccess additions:

  1. Open a virtual dataset (like a view of the data that contains dimensions, variables, attrs, chunks, but no actual data)
Screenshot 2024-06-18 at 4 13 48 PM
  1. Concatenate virtual xr.Datasets
    1. Use xarrays concatenation logic to create virtual views of netcdf’s (more details in virtualizarr documentation)
    2. Save as json/parquet/in-memory dict
Screenshot 2024-06-18 at 4 14 13 PM
  1. Read netcdf/hdf5 data
    1. Use the zarr engine in xarray to load a dataset (with indexes)
Screenshot 2024-06-18 at 4 14 31 PM

Questions/Suggestions:

Changes to the API?

NASA datasets you want me to test?

Take a look at the virtualizarr parser PR and leave suggestions

Mikejmnez commented 3 months ago

This looks great @ayushnag ! I will play with this in the next couple of days