microsoft / torchgeo

TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
https://www.osgeo.org/projects/torchgeo/
MIT License
2.73k stars 337 forks source link

Xarray Dataset #1486

Open nilsleh opened 1 year ago

nilsleh commented 1 year ago

Summary

I am working with different climate data sources that come in the form of .netcdf files and xarrays. Although, I am not an expert in that domain, it seems that this is the go to data format that is frequently used. Since there are lots of features in Torchgeo that I would like to use with this data without having to reformat to tiff files for example, I think it could be quiet powerful to add support for Xarray datasets, even though it would be another couple dependencies to add to Torchgeo.

Rationale

This could quiet possibly extend the horizon of users to other communities that work with Xarray data and benefit from all the tools Torchgeo already provides. In the majority of cases climate data also comes in the form of time-series so this would go hand in hand with the planned support for TimeSeries models and dataloading stuff in Torchgeo.

Implementation

Both in #412 and #509 there was some discussion about this, but nothing was finalized. I am definitely willing to start on this but don't have a detailed plan yet as I first wanted to gather opinions on this.

Alternatives

No response

Additional information

No response

adamjstewart commented 1 year ago

Completely agree with adding support for this, even if it means more deps. @RitwikGupta and @cjrd are our climate experts and may also have thoughts on the best way to do this. I'm not sure if/how we could support a 4th (z) dimension that frequently comes with climate data. But lets first focus on how to best handle xarray, especially when it comes to reprojection and geospatial indexing. Making a new subclass of GeoDataset that works similarly to RasterDataset will already be a big enough endeavor.

Can't remember if @isaaccorley ever worked on this before.

isaaccorley commented 1 year ago

I've use it to load some netcdf files. It has good support for climate datasets and seems like it's widely used by the community.

calebrob6 commented 1 year ago

@weiji14 is another expert here (driver of https://github.com/microsoft/torchgeo/pull/509) and is independently supporting stuff like this in zen3geo https://github.com/weiji14/zen3geo

noahgolmant commented 8 months ago

I would like to continue discussing how to best implement this @nilsleh! I am new to torchgeo, but one challenge here is the current structure of the base GeoDataset abstraction. It assumes a list of file paths to load, in our case likely via rioxarray.open_rasterio. This makes it awkward for a user to supply an arbitrary xarray dataset to __init__. I am not sure how to best support integration with other datasets that may not directly live in a filesystem, like STAC catalogs (via stackstac) or EarthEngine (via xee). We could just ignore the paths variable for now?

I also think for simplicity, it would be easiest to support a single xarray.Dataset or DataArray object with one resolution. This pushes the complexity of merging separate DataArray files back to the user, which doesn't feel too demanding. I don't think this class should do the heavy lifting of reprojecting potentially large datasets. This way, you come in with a nice datacube. In the future, it would be nice to support multiple datasets, or multi-resolution data sources, but there are other changes that would need to happen for that to work-- for example, one reviewer mentioned maintaining the image / mask pattern for the sample output dictionary. Since you can't stack multiple resolutions into one image entry, I think we'd have to deviate from that.

adamjstewart commented 8 months ago

one challenge here is the current structure of the base GeoDataset abstraction. It assumes a list of file paths to load

This is only true for RasterDataset, not GeoDataset. The only requirement is we need to create an R-tree index of bounding boxes, how that is done isn't important.

nilsleh commented 8 months ago

I will try to summarize the discussion and pain points encountered so far when we first started looking at this. Generally, one could consider a sort of similar distinction for these Grid based datasets as we have for Raster datasets.

NonGeoGridDataset

This would be something along the lines of the datacube that @noahgolmant is describing, where you have a fixed xarray cube of Time x Height x Width for several data variables (can specify which variables should be input to your model and which one would be targets) and sampling would consist of retrieving different patches from that data cube, which is just pure indexing. There exist a xpatcher or xbatcher that could serve as the basis for sampling from these datasets.

Pros:

Cons:

Examples:

GeoGridDataset

These datasets would require building an R-tree type index (no restrictions how to achieve that) with the overarching goal of being able to combine GeoGridDatasets with all the other GeoDatasets that could include Raster, Vector data etc.

Pros:

Cons:

Examples:

What we tried in the linked PR so far would fall under a GeoGridDataset which has a lot more subtleties, edge cases etc, but in principle, once solved, it would also cover all NonGeoGridDataset. There are certainly more things to cover here, but maybe this is a starting point for a layout and possible plan of attack, so feel free to criticize or extend any of these points. And as a caveat, I am also not an expert in xarray, so there could be things that I am over or under complicating.

noahgolmant commented 8 months ago

Thanks for the clarifications and explanations @adamjstewart and @nilsleh! @adamjstewart, here I am referring to the paths field and files property in the base GeoDataset class, which in this instance would be left unused by the subclass assuming we take in a loaded object.

@nilsleh I think it would be helpful to constrain the xarray dataset class to be a GeoDataset rather than NonGeoDataset because spatial metadata and coordinates are already required for rioxarray operations like clip and reproject. rioxarray has conventions like x/y named coordinates and a .rio.crs attribute, and it can compute the transforms/resolution from this quickly.

I think it's helpful to defer the work to set metadata and merging arrays to either (1) the user loading datasets from disk or (2) a subclass operating on fixed paths, like the other RasterDataset subclasses in this package. For example, in your original PR, combining xarray DataArrays into a single Dataset object with multiple variables might resolve some of the complexity? I can give it a try. This seems to match more closely the amount of work that RasterDataset does today-- I think the rasters are assumed to have the same set of bands for example.

The GeoGridDataset class does seem like it'd be very powerful! I'd love to see that functionality supported. It would be interesting to see the challenges that come up scaling that to larger spatiotemporal scales as well.

noahgolmant commented 8 months ago

@nilsleh here is a draft, not tested yet, curious to hear your thoughts! https://github.com/microsoft/torchgeo/compare/main...noahgolmant:torchgeo:noah/xarray?expand=1

nilsleh commented 7 months ago

@nilsleh here is a draft, not tested yet, curious to hear your thoughts! https://github.com/microsoft/torchgeo/compare/main...noahgolmant:torchgeo:noah/xarray?expand=1

Cool, so of course difficult to say without tests, but from first glance it looks like it could work.

Generally, NonGeoDatasets can also have spatial metadata etc, the distinction between NonGeo and Geo is mainly how does one draw samples from a dataset. So one could approach an xarray dataset as a fixed datacube that can just be indexed through array indexing without any geo information. For example, the xpatcher libray simply defines a list of index locations that one can loop through and patches are returned from the xarray datacube via array indexing.

However, ideally everything would be a GeoDataset so I do like your approach for that. For me it helped a lot to try and find multiple common datasources and then test the dataset. The other PR also has some dummy data if you would wanna start with that. Might be easier to discuss if you open a PR for your approach.

adamjstewart commented 7 months ago

the distinction between NonGeo and Geo is mainly how does one draw samples from a dataset

It's also about whether or not two datasets can be combined via intersection/union. So if you have a benchmark dataset where you don't need to combine it, NonGeo is fine. But if it's just a single raster input layer or mask, you'll need it to be GeoDataset so you can combine with other datasets (either another xarray dataset or any other dataset as well).