'download=True' condition for more than 1 SBD dataset stops code cause of shutil

Hey @Jasonlee1995 thanks for reporting this.

While your code works for datasets.VOCSegmentation it is inefficient: We have good integrity checks for few files (say the downloaded archives). The same is not true for a large folder of images / annotations. Thus, normally, we raise an error in case we encounter an extracted folder with download=True. For example datasets.Places365

https://github.com/pytorch/vision/blob/7992eb5da9c2e67469734e43f3d07e242d4f5273/torchvision/datasets/places365.py#L34-L36

https://github.com/pytorch/vision/blob/7992eb5da9c2e67469734e43f3d07e242d4f5273/torchvision/datasets/places365.py#L145-L150

Unfortunately, neither VOC* nor SBDataset does this. This means they happily re-extract the archive everytime you construct it with download=True.

IMO we should fix VOC* and SBDataset (better yet: any dataset that relies on folders of data) to also raise this error. @fmassa ? This will probably resolve itself after we integrate https://github.com/pytorch/pytorch/issues/49440 and we can read from archives directly.

@Jasonlee1995 In any case you should only call the dataset constructor with download=True once.

pytorch / vision

'download=True' condition for more than 1 SBD dataset stops code cause of shutil #3316