microsoft / torchgeo

TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
https://www.osgeo.org/projects/torchgeo/
MIT License
2.41k stars 308 forks source link

VectorDataset has always the same time range from 0 to sys.maxsize and how to deal with multiple image and mask pairs for training. #2165

Open tgoelles opened 1 month ago

tgoelles commented 1 month ago

Description

Dates in filename_regex of VectorDataset have no effect. The mint bound is always 0 and the maxt to sys.maxsize. I expected the same behavior as with RasterDataset.

I thought that I could give multipel geotiff files as "image" input and multiple shape files for the masks. Then torchgeo would match the timestamps and use them to train the model.

As it seems now only RasterDataset can handle different timestamps. Does torchgeo even support multiple pairs of mask and image? It seems maybe if both are Rasterdataset maybe, but not with RasterDataset + VectorDataset.

Steps to reproduce

  1. unzip the archive and put it in the sam directory as the code. Archive.zip

# %%
from torchgeo.datasets import RasterDataset, VectorDataset
from torchgeo.datasets.utils import BoundingBox
from typing import Any
import os
import geopandas as gpd

# %%
import torchgeo

torchgeo.__version__

# %% [markdown]
# 

# %%
class MasterDataset(RasterDataset):
    filename_glob = "sar_model_input*.tif"
    filename_regex = r"sar_model_input_(?P<date>\d{8}T\d{6})_.*\.tif"
    date_format = "%Y%m%dT%H%M%S"
    all_bands = ["vh", "vv", "vvvh"]
    separate_files = False
    is_image = True

    def __init__(self, root: str, **kwargs: Any):
        super().__init__(root, **kwargs)

    def __getitem__(self, query: BoundingBox) -> dict[str, Any]:
        sample = super().__getitem__(query)
        return sample

# %%
class AvalancheMasksDataset(VectorDataset):
    filename_glob = "avalanche_outlines*.shp"
    filename_regex = r"avalanche_outlines_(?P<date>\d{8}T\d{6})_.*\.shp"
    date_format = "%Y%m%dT%H%M%S"
    is_image = False

    def __init__(self, root: str, **kwargs: Any):
        super().__init__(root, **kwargs)

    def __getitem__(self, query: BoundingBox) -> dict[str, Any]:
        sample = super().__getitem__(query)
        return sample

# %%
current_dir = os.getcwd()

# %%
master_dataset = MasterDataset(current_dir)
avalanche_masks_dataset = AvalancheMasksDataset(current_dir)

# %%
master_dataset.bounds

# %%
avalanche_masks_dataset.bounds

# %%
avalanche_masks_dataset.files

# %%
dataset = master_dataset & avalanche_masks_dataset

# %%
dataset.bounds

# %%
dataset.index

### Version

0.5.2
adamjstewart commented 1 month ago

Hi @tgoelles! The feature you're describing was actually recently added by @oddeirikigland in #1814 and will be included in the next 0.6.0 release. While you wait, you could also install a development version to get the latest features. I'm hoping to release 0.6.0 soon, but teaching and deadlines have kept me busy.

tgoelles commented 1 month ago

Hi @adamjstewart.

I tried the new VectorDataset and now the bounds work for the avalanche_masks_dataset.

Now to the second part of the question. I want to train on multiple maks - images pairs. Pairs which belong together have the same timestamp.

avalanche_masks_dataset has shape files with different dates master_dataset has geotiff files also with different dates.

I thought that this is implemented, and what the time bounds are for? It seems that IntersectionDataset and RandomBatchGeoSampler do not match date pairs? So how should I deal with this and are there plans to include this?

adamjstewart commented 1 month ago

It seems that IntersectionDataset and RandomBatchGeoSampler do not match date pairs?

As far as I know, they should. If you can provide a few sample (can be fake) images and masks that reproduce the problem you are seeing, I can try debugging this.

tgoelles commented 1 month ago

Here is an example: https://www.transfernow.net/dl/20240718Zfb8Iwjq

maybe it does work, but I have no idea if it works. In general it is a bit frustrating, that it feels too much like a blackbox and does not fail or at least warn when things are not working. So something like a strict and/or verbose mode would be great in addition to better documentation. I think I could contribute something to that in the future.

adamjstewart commented 1 month ago

it feels too much like a blackbox

Welcome to deep learning 😆

maybe it does work, but I have no idea if it works.

You can quickly check by plotting the pairs that get sampled. You could also print the filename that is being loaded in torchgeo/datasets/geo.py.

So something like a strict and/or verbose mode would be great

We need to make sure it doesn't impact speed, but we could definitely add optional verbose logging just to see what files are being loaded/written.

oddeirikigland commented 1 month ago

I thought that this is implemented, and what the time bounds are for? It seems that IntersectionDataset and RandomBatchGeoSampler do not match date pairs? So how should I deal with this and are there plans to include this?

The image and masks should match in pairs based on the date. But keep in mind that the RandomSampler gets a random area within the intersection of the image and mask file. Since you have a full TIFF file, I assume the avalanche mask is only a small portion of it, making a lot of the samples negative.

adamjstewart commented 3 weeks ago

@tgoelles are you still experiencing any issues, or can we close this?