microsoft / torchgeo

TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
https://www.osgeo.org/projects/torchgeo/
MIT License
2.63k stars 326 forks source link

TorchData #576

Closed austinmw closed 1 year ago

austinmw commented 2 years ago

Hi, do you plan to support TorchData iterable-style and map-style datapipes in the future?

I ask since eventually the PyTorch DataLoader V2 will, "only be responsible for multiprocessing, distributed, and similar functionalities, not data processing logic. All data processing features, such as the shuffling and batching, will be moved out of DataLoader to DataPipe."

https://github.com/pytorch/data#frequently-asked-questions-faq

adamjstewart commented 2 years ago

Yes, we definitely plan to support DataPipes in the future. When I first talked to the torchvision devs, they mentioned the plan to rework their datasets to use DataPipes. At the time, the DataPipe stuff seemed too bleeding edge for us to use directly, but things are definitely more stable now. I need to take another look and see just how different things are.

austinmw commented 2 years ago

That's awesome, really glad to hear!

adamjstewart commented 2 years ago

Still need to dig deeper into how TorchData works and how torchvision is planning to migrate to TorchData, but I think this will be a good opportunity to refactor.

Right now, we have two class/subclass hierarchies:

I think it would make more sense to do something like:

If I understand correctly, this seems to be the intention of TorchData, to create pluggable pipelines for each file format to improve reuse and avoid code duplication.

adamjstewart commented 2 years ago

Another area where TorchData may help: we have a lot of datasets that can either be loaded from files on local disk, or streamed from a STAC API like on the Planetary Computer. I believe that was one of the main driving factors behind TorchData, so I'm interested to see if they've found a good way to have a single dataset that optionally loads from different sources like this.

adamjstewart commented 2 years ago

Looked through the documentation a bit. From what I can tell, my first comment is definitely supported by TorchData. I opened an issue to see if my second comment is/could be supported as well: https://github.com/pytorch/data/issues/672

austinmw commented 2 years ago

@adamjstewart One more format that might be good to support down the line is simple tar iterable format (like webdataset, only using torchdata). For your second comment, I wonder if you're looking for something like AIStore with torchdata loaders?

https://pytorch.org/data/main/generated/torchdata.datapipes.iter.AISFileLoader.html#torchdata.datapipes.iter.AISFileLoader https://pytorch.org/data/main/generated/torchdata.datapipes.iter.AISFileLister.html#torchdata.datapipes.iter.AISFileLister

adamjstewart commented 1 year ago

Seems like TorchData is dead: https://github.com/pytorch/data/issues/1196