Support reading from GDAL virtual file systems (e.g. cloud storage)

microsoft / torchgeo

TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data

https://www.osgeo.org/projects/torchgeo/

MIT License

2.63k stars 325 forks source link

Support reading from GDAL virtual file systems (e.g. cloud storage) #1398

Open adriantre opened 1 year ago

adriantre commented 1 year ago

https://github.com/microsoft/torchgeo/blob/9e57f278188ca36348ce8d5c30d5ae2acb19107c/torchgeo/datasets/geo.py#L363-L367

GDAL virtual file systems such as reading directly from Google Buckets (/vsigs/) are natively supported by rasterio (through gdal).

with rasterio.open("/vsigs/my_bucket/.../my_image.tif") as src:
    src.read()  # etc.

The glob-matching (source code linked above) is the only thing stopping this currently.

What do you think the best way is to do this? My initial guess is that supporting the glob-matching for all the different file systems would take some effort.

The quickest fix (for me at least) would be to add an optional parameter filenames:List that is iterated, and the (already existing) try/except would handle if the filename is wrong.

adriantre commented 1 year ago

Edit: Better proposal below.

Proposed changes:

class RasterDataset(GeoDataset):
    def __init__(
        self,
        ...,  # existing params
        filenames: Optional[List[str]] = None
    ) -> None:

        ...

        # Populate the dataset index
        i = 0
        if not filenames:
            pathname = os.path.join(root, "**", self.filename_glob)
            filepaths = [filepath for filepath in glob.iglob(pathname, recursive=True)]
        else:
            filepaths = [os.path.join(root, filename) for filename in filenames]
        for filepath in filepaths:
            # continue on line 366 in the original code

and filenames should contain eventual subdirectories.

adriantre commented 1 year ago

Just found the listdir-method of fiona. It does not support recursive walks but will list sub-blobs in virtual file systems.

from fiona.errors import FionaValueError

def listdir_vsi_recursive(root):
    dirs = [root]
    files = []
    while dirs:
        dir = dirs.pop()
        try:
            subdirs = fiona.listdir(dir)
            dirs.extend([os.path.join(dir,subdir) for subdir in subdirs])
        except FionaValueError:
            files.append(dir)
    return files

class RasterDataset(GeoDataset):
    def __init__(
        self,
        ...,  # existing params
        vsi: bool = False
    ) -> None:

        ...

        # Populate the dataset index
        i = 0
        filename_regex = re.compile(self.filename_regex, re.VERBOSE)
        if vsi:
            filepaths = listdir_vsi_recursive(root)
        else:
            pathname = os.path.join(root, "**", self.filename_glob)
            filepaths = [filepath for filepath in glob.iglob(pathname, recursive=True)]
        for filepath in filepaths:
            # continue on line 366 in the original code

adamjstewart commented 11 months ago

Note that we technically support this in 0.5.0, although the user has to manually pass in a list of files.