microsoft / torchgeo

TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
https://www.osgeo.org/projects/torchgeo/
MIT License
2.62k stars 322 forks source link

Sentinel2 Dataset behavior #1758

Open calebrob6 opened 9 months ago

calebrob6 commented 9 months ago

Description

I have a Sentinel 2 scene with the following files (e.g. in ./test_scene/):

T36KVU_20210513T073609_B01_60m.tif
T36KVU_20210513T073609_B02_10m.tif
T36KVU_20210513T073609_B03_10m.tif
T36KVU_20210513T073609_B04_10m.tif
T36KVU_20210513T073609_B05_20m.tif
T36KVU_20210513T073609_B06_20m.tif
T36KVU_20210513T073609_B07_20m.tif
T36KVU_20210513T073609_B08_10m.tif
T36KVU_20210513T073609_B09_60m.tif
T36KVU_20210513T073609_B11_20m.tif
T36KVU_20210513T073609_B12_20m.tif
T36KVU_20210513T073609_B8A_20m.tif
T36KVU_20210513T073609_TCI_10m.tif

I would expect any of the following to work:

ds = Sentinel2(
    "test_scene/",
    bands=["B01"],
)
ds = Sentinel2(
    "test_scene/",
    bands=["B01", "B02"],
)
ds = Sentinel2(
    "test_scene/",
    bands=["B01", "B02"],
    res=37
)

However the filename_glob and filename_regex are setup in such a way that none of the above are recognized as valid Sentinel 2 scenes.

Steps to reproduce

see above

Version

0.6.0.dev0

calebrob6 commented 9 months ago

Further:

ds = Sentinel2(
    "test_scene/",
    bands=["B01", "B02"],
    res=60,
)

will not throw an error, but ds[ds.bounds] will throw an error.

@estherrolf for visibility

adamjstewart commented 9 months ago

This was specifically broken by https://github.com/microsoft/torchgeo/pull/754/files#diff-79277b084e67f13f6469cba19e6eadb93ce6c6479cef26161a0c847b75705a81

Basically, depending on where you download your data from, you either get:

  1. All bands in 10m resolution (resampled, no suffix)
  2. All bands in native resolution (10m, 20m, or 60m)
  3. All bands in all resolutions (10m, 20m, and 60m)

1, 2, and 3 are all somewhat contradictory. We could easily support each of these on their own, but supporting all 3 in combination is hard:

A. Remove resolution from the regex (only supports 1) B. Replace resolution with a wildcard (only supports 1 and 2) C. Include 10m in the regex (only supports 3)

In order to prioritize the highest resolution, maybe we could sort the glob results lexicographically and choose the first one only? But that feels really sloppy and could probably break for more complicated hypothetical datasets.

calebrob6 commented 9 months ago

I think this is strange behavior as one of the points of RasterDataset is that it can resample/align different layers to the same resolution.