Support FLAIR Datamodule for semantic segmentation

rbavery commented 2 months ago

Summary

I just stumbled across FLAIR, a high res (.2 meter) semantic segmentation dataset of 19 land cover categories. Full description here https://ignf.github.io/FLAIR/#FLAIR1 That link also references a U-Net model baseline trained on FLAIR

It is described in an OGC Report on ML Engineering as "a comprehensive and high-quality collection of labeled satellite imagery aimed at advancing land cover classification and geospatial analysis tasks". It's maintained by the French National Institute of Geographic and Forest Information (IGN).

Rationale

I'm interested in composing a list of high quality, challenging datasets for benchmarking semantic segmentation and object detection models and at a glance FLAIR seems like one of them. It seems like the rigor of maintenance and description of the FLAIR dataset is high compared to other datasets and so I would like to see this available in torchgeo.

Also, I think I recall seeing that torchgeo would like to offer models that are inference-ready, not just pretrained with self-superivison but fine-tuned to address popular tasks in remote sensing. This U-Net model seems like a good start, but I can raise a separate issue for adding the model.

Implementation

I haven't contributed to torchgeo before but I would check out other PRs that added datamodules and pretrained models and follow that example.

Alternatives

No response

Additional information

This comment is on my mind: https://github.com/Clay-foundation/model/discussions/269#discussioncomment-9704291

I'd love for more challenging datasets to be front and center when benchmarking models rather than Eurosat. I'd also like for us to use datasets that we have a solid understanding of the geographic and class distribution and I like that FLAIR lays this out on their site.

I might be a bit slow to implement this but would like to submit a PR when I have time if this sounds like a good idea.

adamjstewart commented 2 months ago

Yes, we would love to have FLAIR in TorchGeo!

If you would like to take a stab at this, see https://torchgeo.readthedocs.io/en/stable/user/contributing.html#datasets for a list of files you'll need to modify. We have dozens of other semantic segmentation datasets you can base your code on. Let us know if you have any trouble with the testing and we would be happy to help!

MathiasBaumgartinger commented 1 month ago

Interestingly, I have just been working with FLAIR and they happily gave an admission for using their dataset in torchgeo. I am currently working on releasing it as a datamodule, however, there are some flaws with the current state of the FLAIR dataset which make them somewhat hard to integrate (see @https://github.com/microsoft/torchgeo/discussions/2292).

adamjstewart commented 1 month ago

I thought we solved all the problems in #2292?

MathiasBaumgartinger commented 1 month ago

Absolutely my bad! I linked the wrong issue: https://github.com/rasterio/rasterio/discussions/3178 and https://github.com/OSGeo/gdal/issues/10820.

As discussed there, the FLAIR dataset currently provides an geographically underspecified mask dataset. While there is a workaround for this (for some reason i found both a gdal_edit.py and a subsequent gdalwarp with a correct CRS are necessary), I spoke to one of the maintainers of the dataset about a possible fix on their side.

My initial plan was to wait for the maintainers to fix the problem and release the module afterwards. However, if you think that would be appropriate, I could integrate the GDAL operations in the DataModule and create a PRQ right away.

adamjstewart commented 1 month ago

The images are already pre-chipped. Any reason you don't want to make a NonGeoDataset?

MathiasBaumgartinger commented 1 month ago

Not necessarily. I just thought if there is geo-information, I might as well make it accessible. Do you prefer it as NonGeoDataset?

adamjstewart commented 1 month ago

I find NonGeoDataset easier to use, especially if the geo information is corrupted, or if there are multiple CRSs in use. The only reason to use GeoDataset is if the images are not pre-chipped or if you need to combine them with other GeoDatasets.

MathiasBaumgartinger commented 1 month ago

Alright! Another thing that comes to mind is that the most recent release (FLAIR#2) jointly includes preprocessed SENTINEL-2 data alongside aerial and mask images. A short summary of the changes mentioned in the datapaper:

[There is a] strong difference in spatial resolution [...]. Therefore, in order to also provide a minimum of context from the satellite data, a buffer was applied to create super-areas.

Use of super areas only

in order to limit the size of the data and due to the wide extent of the dataset, only the super-areas were downloaded
Resampling

the 20 m spatial resolution bands are first resampled during data retrieval to 10 m by the nearest interpolation method. Same approach is adopted for the cloud and snow masks
Removal of nodata pixels

nodata pixels (reflectances at 0) [...] were removed
Reprojection

subsequently reprojected into the Lambert-93 projection (EPSG:2154) which is the one of the aerial imagery.
Additional information

Data Type	Naming	Shape
ground truth	`SEN2_xxxx_data.npy`	$T \times C \times H \times W$
snow/cloud masks	`SEN2_xxxx_masks.npy`	$T \times C \times H \times W$
time series products	`SEN2_xxxx_products.txt`	-
json mapping	`flair-2_centroids_sp_to_patch.json`	-

Sentinel-2 super-areas (SEN2) data is composed of several elements - data, masks, products and a JSON file to match aerial and satellite imager [The JSON file] uses the aerial patch name (e.g., IMG 077413) as the key and provides a list of two indexes (e.g., [13,25]) that represent the data-coordinates of the aerial patch centroids

With the considerations above, I suppose it would be better to include the data provided by the maintainers of FLAIR as opposed to the original SENTINEL-2 dataset of torchgeo using an intersection dataset. Any thoughts on that?

adamjstewart commented 1 month ago

Since the filenames and file formats are completely different from raw Sentinel-2 data, we would either have to create a new class for FLAIRSentinel2(Sentinel2) and use an intersection, or just use a NonGeoDataset to avoid all of that complexity.

agarioud commented 1 month ago

Hello,

As a maintainer of the FLAIR dataset at IGN, i greatly appreciate your effort in integrating our dataset.

Seeing this issue i would like to give you some information about the release of a new version of FLAIR in the next weeks. Among others, it will spatially align (patch-wise, so no super-areas any more) multiple modalities, included Sentinel-2, with common file formats. This new release will also have a bigger scale (about 3 times the current size).

You might consider waiting until this new version is released before putting in the effort of integrating the FLAIR#2 Sentinel-2 imagery. Aerial imagery will stay in the same format.

We are happy to provide any information or support that can help in this effort.

MathiasBaumgartinger commented 1 month ago

@agarioud, nice to hear directly from you! I've already finished a reasonably working version, so the effort to finish everything (mainly documentation/cleanup) would be very small. Can you give me a more specific time frame in which you plan to release?

Also, are you willing to share the pre-processing steps applied on the Sentinel-2 data? In a released version I would like to be able to perform those processing steps on other Sentinel-2 data as well to achieve maximum performance during prediction in unseen areas.

agarioud commented 1 month ago

Unfortunately i cannot give you an exact time frame. We are currently working on preparing the data, and as for the previous releases, we would like to add some documentation to it. I'll notify you as soon as we have more visibility.

Regarding pre-processing of Sentinel-2 we have the following steps : we use BOA L2A data, cropped to the aerial patch extent (which is 512px at 0.2m so 102.4x102.4 m), resample to 10.24 m the Sentinel-2 to have 10x10 pixels patches. For each patch we stacked the 10 spectral bands of each dates (i.e., if 38 acquisitions, the patch has 380 channels) together to reduce inode footprint of the dataset. If you need more precise information you can contact me : flair@ign.fr

Also, we will release a new batch of pre-trained models on the new dataset on our HuggingFace IGNF page.

agarioud commented 1 month ago

I forgot to say that we store snow and cloud masks as separate files. Also, the acquisition dates are stored in a JSON file for each area.

adamjstewart commented 1 month ago

Do you think it would be useful to have a single dataset with a version parameter that allows users to choose which version of the dataset they want? I'm guessing this would primarily be useful for historical reasons (to compare against papers that used v1). Could also have a base class with subclasses for each version, but it sounds like the name is the same. I guess it depends on how similar the file structure is and if the only difference is simply the total number of images. Either way, from the TorchGeo side, I'm happy with multiple versions of FLAIR if it isn't too much work to support.

agarioud commented 1 month ago

The new release will include all previous areas and data but extend to other areas and other modalities. As such, i think a versioning is not necessary, rather than a 'area/patch' selection corresponding to the FLAIR#1 and #2 versions ?

That being said, if one would like to include the super-areas of FLAIR#2, this would need some specific dataloading.

rbavery commented 1 month ago

I find NonGeoDataset easier to use, especially if the geo information is corrupted, or if there are multiple CRSs in use. The only reason to use GeoDataset is if the images are not pre-chipped or if you need to combine them with other GeoDatasets.

The Clay model has a location encoder and could utilize the geographic information. I think a GeoDataset would be more valuable in the long run for models the can accept inputs beyond images. It also provides useful context for sampling and evaluation.

MathiasBaumgartinger commented 1 month ago

I find NonGeoDataset easier to use, especially if the geo information is corrupted, or if there are multiple CRSs in use. The only reason to use GeoDataset is if the images are not pre-chipped or if you need to combine them with other GeoDatasets.

The Clay model has a location encoder and could utilize the geographic information. I think a GeoDataset would be more valuable in the long run for models the can accept inputs beyond images. It also provides useful context for sampling and evaluation.

That is pretty much my initial thought. Lots of research trying to utilize information beyond just color channels.

So concluding: I will create a first pull request using a NonGeoDataset/NonGeoDataModule for the current version 2 of Flair. I let the maintainers decide whether to merge it or wait for the newly released version. If left umerged for now, people may still cherry-pick it.

In any case I will try my best to update the module ASAP once @agarioud and his team release the new FLAIR dataset with properly specified CRS on the masks.

adamjstewart commented 1 month ago

Note that it is possible to return lat/long coords from a NonGeoDataset. The difference (in my mind) between that and a GeoDataset is storing all bounding boxes in a spatiotemporal R-tree. This can be slower, but makes it easier to sample small patches from large tiles or to combine the dataset with other GeoDatasets.

adamjstewart commented 3 weeks ago

I think @nilsleh needs a FLAIR data loader for some of his work. From our side, we would love to see a version 1 data loader in the near future that can later be converted to a version 2 once the new dataset is released.

MathiasBaumgartinger commented 3 weeks ago

Hi! I had a packed schedule the last few weeks. I think I can work on refining my first draft and create a first PRQ for review tomorrow.

MathiasBaumgartinger commented 3 weeks ago

FYI: I have been working on the FLAIR dataset yesterday and today. However, the integration of the sentinel data (which I have not used before) sadly turns out far more complicated than I hoped.

You can see my progress at: https://github.com/MathiasBaumgartinger/torchgeo

rbavery commented 3 weeks ago

What were the challenges?

MathiasBaumgartinger commented 3 weeks ago

Well, what took me the most time was a classic y, x instead of x, y ordering mistake 😅 . EDIT: other challenges described in the PRQ.

Happy to share a first draft: https://github.com/microsoft/torchgeo/pull/2394 📨

microsoft / torchgeo