pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.09k stars 6.94k forks source link

Questions about prototype builtin datasets using `torchdata` #7609

Closed ain-soph closed 1 year ago

ain-soph commented 1 year ago

Hi all, I'm currently exploring builtin datasets with new standards:
https://github.com/pytorch/vision/blob/main/torchvision/prototype/datasets

Let's take Cifar10 as an example. I have several questions:

  1. Why are all datasets constructed as iter rather than map style? When I have an index (e.g., 2331), I can no longer use dataset[2331] like the old CIFAR10.
    In this case, how to get_item for the new format dataset? Do I have to use IterToMapConverter? That'll be quite strange because raw data format is map, I make it iter and traverse to change back to map.
  2. What does hint_shuffling do?
    def hint_shuffling(datapipe: IterDataPipe[D]) -> Shuffler[D]:
        return Shuffler(datapipe, buffer_size=INFINITE_BUFFER_SIZE).set_shuffle(False)

    It's used in all prototype datasets. It seems to wrap datapipe with a shuffler but set_shuffle(False). That seems doing nothing?

  3. When to use Decompressor and set resource.preprocess='decompress' or 'extract'?
    What's the difference among Decompressor, resource.preprocess='decompress', resource.preprocess='extract' and using nothing?
    • Cifar10 resource is a cifar-10-python.tar.gz and sets nothing. It will default call _guess_archive_loader in OnlineResource.load to generate a TarArchiveLoader
    • MNIST resource is a train-images-idx3-ubyte.gz and uses a Decompressor
    • cub200 resource is a CUB_200_2011.tgz uses decompress=True
  4. How to use Transform in the new dataset API? such as AutoAugment or RandomCrop? Especially about ToTensor or transforms.PILToTensor(), transforms.ConvertImageDtype(torch.float) (since prototype dataset returns uint8 Tensor). From the Transform V2 Tutorial Page, I may assume that transform is no longer embedded in Dataset because it doesn't accept transform or target_transform args? Then how can I fetch augmented data from the DataLoader?
  5. For dataset that each image is stored in encoded image format (the old ImageFolder type. e.g., ImageNet, GTSRB),the output image format is EncodedImage -> EncodedData -> Datapoint. For dataset stored in binary (e.g., MNIST and CIFAR), the output image format is Image -> Datapoint. Why are they different? I see most transform V2 APIs are conducted on Image. Why is EncodedImage used here?
NicolasHug commented 1 year ago

Hi @ain-soph and sorry for the silence... I was kind of waiting for this to be finally announced officially: https://github.com/pytorch/data/#torchdata-see-note-below-on-current-status

I'll try to provide very brief answers to your questions below

  1. It's never been clear to me why map-style datapipes even exist
  2. the shuffleing hints makes sure shuffling happens where it needs to happens if users set shuffle=True in the dataloader. It's pretty awful. But shuffling absolutely needs to happen before sharding (https://github.com/pytorch/data/issues/302) and this is the only way we found to prevent users shooting themselves in the foot.
  3. Honestly, IDK. I don't recommend relying on this
  4. How to use Transform in the new dataset API? Don't - we're not gonna release those datasets anytime soon
  5. Why is EncodedImage used here? Not sure honestly, probably relics of past designs that we haven't updated