pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
15.99k stars 6.92k forks source link

'download=True' condition for more than 1 SBD dataset stops code cause of shutil #3316

Open Jasonlee1995 opened 3 years ago

Jasonlee1995 commented 3 years ago

'download=True' condition for more than 1 dataset stops code cause of shutil

I always code dataset and dataloader as below

image

But at this time, dealing with SBD dataset, I get stucked as below

image

I saw the torchvision dataset source code and documentation, I think it would be more helpful and friendly to other users if

image

image

I know it's trivial but hope someone else like me don't get suffered from :)

cc @pmeier

pmeier commented 3 years ago

Hey @Jasonlee1995 thanks for reporting this.

While your code works for datasets.VOCSegmentation it is inefficient: We have good integrity checks for few files (say the downloaded archives). The same is not true for a large folder of images / annotations. Thus, normally, we raise an error in case we encounter an extracted folder with download=True. For example datasets.Places365

https://github.com/pytorch/vision/blob/7992eb5da9c2e67469734e43f3d07e242d4f5273/torchvision/datasets/places365.py#L34-L36

https://github.com/pytorch/vision/blob/7992eb5da9c2e67469734e43f3d07e242d4f5273/torchvision/datasets/places365.py#L145-L150

Unfortunately, neither VOC* nor SBDataset does this. This means they happily re-extract the archive everytime you construct it with download=True.

IMO we should fix VOC* and SBDataset (better yet: any dataset that relies on folders of data) to also raise this error. @fmassa ? This will probably resolve itself after we integrate https://github.com/pytorch/pytorch/issues/49440 and we can read from archives directly.

@Jasonlee1995 In any case you should only call the dataset constructor with download=True once.