GeoDataset is not well-suited to datasets with multiple CRSes

RitwikGupta commented 2 years ago

A common workflow to get geospatial datasets ML-ready is to reproject them to the correct UTM zone they originate in with fixed xres and yres in order to work with them directly in pixel space. Thus, datasets comprised of imagery from a wide area will have multiple CRSes.

RasterDataset currently relies on all the data being in the same CRS. If it is not, RasterDataset will automatically reproject the data to a fixed CRS (user-specified or the one found in the first raster).

One proposed solution is to use one RasterDataset per CRS and then combine them using torch.utils.data.ConcatDataset. This could work but feels inelegant.

A flag which lets me choose whether to reproject or not could work here.

adamjstewart commented 2 years ago

One proposed solution is to use one RasterDataset per CRS and then combine them using torch.utils.data.ConcatDataset. This could work but feels inelegant.

I don't believe this will work. ConcatDataset requires that both datasets have a well-defined length and use integer indexing.

I definitely agree that this is something we need to support. The biggest issue with reprojection is that it significantly slows down sampling rates (see section 4.2 of our paper). @calebrob6 implemented something like what you're looking for for the Chesapeake CVPR dataset, but it isn't compatible with any other GeoDataset.

I think this will likely be a major change we'll make for the 0.3.0 release (0.2.0 is coming soon). Reprojection is necessary for two purposes:

Combining two different GeoDatasets into an IntersectionDataset or UnionDataset to sample from both
Sampling multiple overlapping scenes from a single GeoDataset

We could potentially provide a flag that says "don't reproject, I know this prevents me from doing 1 and 2". I'm trying to think of a more elegant solution that only reprojects when necessary. Will update this issue thread as I come up with new ideas. Feel free to add to this thread if you (or anyone else) has ideas!

RitwikGupta commented 2 years ago

For 1, the whole idea of reprojecting to UTM with fixed pixel sizes is that the CRS no longer matters. For all purposes, I have turned the rasters into their locally low-distortion representations with square pixels. Why should I not be able to sample from two different datasets at that point? Is ConcatDataset not then equivalent to UnionDataset?

For 2, if all the scenes in the GeoDataset are already in the same CRS, then why does there need to be any reprojection step involved?

It could also be that I am being too clever for my own good w.r.t. this library. I could just leave my rasters in ESPG:4326, but then I incur large distortions at polar regions and oceans (up to 40km!). That's why we reproject to "local" UTMs in the first place.

adamjstewart commented 2 years ago

Why should I not be able to sample from two different datasets at that point?

That assumes that both datasets are in UTM. If they aren't, reprojection is again necessary.

Is ConcatDataset not then equivalent to UnionDataset?

PyTorch's ConcatDataset is similar in theory to UnionDataset (combining two datasets to get a bigger dataset) but is completely different under the hood.

For 2, if all the scenes in the GeoDataset are already in the same CRS, then why does there need to be any reprojection step involved?

All scenes in the GeoDataset are not in the same CRS, each scene is in its local UTM zone, and images on the boundaries of UTM zones that overlap could be in different UTM zones.

calebrob6 commented 2 years ago

Going to take a stab at this

For 1, the whole idea of reprojecting to UTM with fixed pixel sizes is that the CRS no longer matters. For all purposes, I have turned the rasters into their locally low-distortion representations with square pixels. Why should I not be able to sample from two different datasets at that point? Is ConcatDataset not then equivalent to UnionDataset?

The thing that doesn't work here is the way we actually index into GeoDatasets. If you have a GeoDataset, ds, then you do not index into it with an integer, e.g. ds[2], but instead use ds[bb] where bb is a BoundingBox that contains geographic coordinates (not pixel coordinates). Explaining why we do it this way requires a full rewind of most of our thinking on this to date:

Consider the general problem of indexing into a dataset (just getting some crop out of the dataset) made up of two rasters, A and B, that are projected into different UTM zones:

You can do this using geographic coordinates, i.e. a BoundingBox instance with values in some CRS (the CRS of A, CRS of B, or something else).
You can do this using pixel coordinates, i.e. a tuple (i, x, y, width, height) where i chooses A or B

Now, further, consider the problem where you want to be able to do spatial joins with this dataset. Assume you have another dataset with a raster C in some geographic coordinate system that covers both A and B (maybe this is a global land cover map or something). You'd like to sample patches from both dataset and have everything line up. (You aren't asking about this use case, but it helps explain why we've done what we have.)

For the first indexing approach: If you create a BoundingBox using the CRS of A, then you will need to reproject this bounding box to the CRS of B when you want to index into B. If you do this, then your BoundingBox will get rotated and everything will be awful (this is what happens in the ChesapeakeCVPR dataset and it is a pain -- let me know if you want me to explain further here). If you instead assume that B will be reprojected on-the-fly to the CRS of A, then indexing becomes easy.

For the second indexing approach: This is fine, but it is very hard to see how you can then do spatial joins. I previously created a PyTorch Dataset that you may find useful here (https://github.com/microsoft/ai4eutils/blob/master/geospatial/data/StreamingDatasets.py). You pass it a list of filenames to tiffs (and optionally masks) then it returns random patches of fixed size.

~We might want to have code that sits in between what we currently call "Geospatial datasets" and "Non-geospatial (i.e. already chipped up) datasets" to handle this use case, I'm not sure.~

@adamjstewart @isaaccorley, what if we allowed RasterDatasets to be indexed both ways described above then adjust the Samplers to behave in "mode 1" (geographic space) or "mode 2" (pixel-space)? Note: it is easy to convert (i, x, y, width, height) into a BoundingBox. Does this solve everything?

adamjstewart commented 2 years ago

I would have to think more about this. I wonder if it's possible to not reproject at all by default and only reproject when sampling from two datasets in a different CRS.

adamjstewart commented 2 years ago

Note: it is easy to convert (i, x, y, width, height) into a BoundingBox.

I'm assuming i here is the object id (file number) in the rtree? It's possible to convert (i, x, y, width, height) into a BoundingBox, but the reverse isn't possible since a bounding box may correspond to multiple overlapping files, each with a different origin (both in pixel-space and geographic space).

what if we allowed RasterDatasets to be indexed both ways described above then adjust the Samplers to behave in "mode 1" (geographic space) or "mode 2" (pixel-space)? ... Does this solve everything?

From what I can tell, "mode 2" would not allow intersections between datasets or overlapping files in a dataset, is this correct? This is pretty fundamental to GeoDataset, if we removed this we might as well just use a VisionDataset.

Ideally, what I would like to do is avoid reprojection when we don't need it and only reproject when sampling from multiple files in different CRSs. What if we:

Keep the index rtree in a single warped CRS like we currently do
Add the native CRS of the file to the rtree in addition to the filename
Change __getitem__ to accept a tuple (bbox, reproject=True/False) and only reproject when True
Teach the sampler to return (bbox, True) if a bbox would involve multiple CRSs, else (bbox, False)

I think we might run into issues with varying image size if we don't reproject though...

calebrob6 commented 2 years ago

the reverse isn't possible

Correct, but I don't think the reverse is needed.

From what I can tell, "mode 2" would not allow intersections between datasets or overlapping files in a dataset, is this correct? This is pretty fundamental to GeoDataset, if we removed this we might as well just use a VisionDataset.

You are correct that "mode 2" doesn't allow intersections, however if that is necessary, then the (i, x, y, width, height) can be converted.

Ideally, what I would like to do is avoid reprojection when we don't need it and only reproject when sampling from multiple files in different CRSs

I agree. Perhaps it'd be useful to write out expected behavior in different scenarios, e.g.:

If I use a sampler with a RasterDataset made up of tiles from different CRSs and don't use it as part of an IntersectionDataset, then I do not expect any reprojections to happen.

adriantre commented 1 year ago

What would it take to use multiple datasets? And the GeoSampler could create one grid for each dataset, or maybe multiple samplers are supported? Then the batch_indices along with dataload_idx lets us reference back to the source raster/file. https://lightning.ai/docs/pytorch/stable/data/iterables.html#using-lightningdatamodule

adamjstewart commented 1 year ago

What advantage would that have? You would still need to reproject and merge all indices inside the GeoSampler, then reproject the bbox back to the original CRS before sampling. I just don't want to make people create a separate dataset for every UTM Zone for a global dataset.

microsoft / torchgeo

GeoDataset is not well-suited to datasets with multiple CRSes #278