Standardization of the datasets

pmeier commented 5 years ago

This is a discussion issue which was kicked of by #1067. Some PRs that contain ideas are #1015 and #1025. I will update this comment regularly with the achieved consensus during the discussion.

Disclaimer: I have never worked with segmentation or detection datasets. If I make same wrong assumption regarding them, feel free to correct me. Furthermore, please help me to fill in the gaps.

Proposed Structure

This issues presents the idea to standardize the torchvision.datasets. This could be done by adding parameters to the VisionDataset (split) or by subclassing it and add task specific parameters (classes or class_to_idx) to the new classes. I imagine it something like this:

import torch.utils.data as data

class VisionDataset(data.Dataset):
    pass

class ClassificationDataset(VisionDataset):
    pass

class SegmentationDataset(VisionDataset):
    pass

class DetectionDataset(VisionDataset):
    pass

For our tests we could then have a generic_*_dataset_test as is already implement for ClassificationDatasets.

`VisionDataset`

As discussed in #1067 we could unify the argument that selects different parts of the dataset. IMO split as a str is the most general, but still clear term for this. I would implement this as positional argument within the constructor. This should work for all datasets, since in order to be useful each dataset should have at least a training and a test split. Exceptions to this are the Fakedata and ImageFolder datasets, which will be discussed separately.
IMO every dataset should have a _download method in order to be useful for every user of this package. We could have the constructor have download=True as keyword argument and call the download method within it. As above, the Fakedata and ImageFolder datasets will be discussed below.

`Fakedata` and `ImageFolder`

What makes these two datasets special, is that there is nothing to download and they are not splitted in any way. IMO they are not special enough to not generalise the VisionDataset as stated above. I propose that we simply remove the split and download argument from their constructor and raise an exception if someone calls the download method.

Furthermore the Fakedata dataset is currently a ClassificationDataset. We should also create a FakeSegmentationData and a FakeDetectionData dataset.

`ClassificationDataset`

The following datasets belong to this category: CIFAR*, ImageNet, *MNIST, SVHN, LSUN, SEMEION, STL10, USPS, Caltech*

Each dataset should return PIL.Image, int if indexed
Each dataset should have a classes parameter, which is a tuple with all available classes in human-readable form
Currently, some datasets have a class_to_idx parameter, which is dictionary that maps the human-readable class to its index used as target. I propose to change the direction, i.e. create a idx_to_class parameter, since IMO this is the far more common transformation.

`SegmentationDataset`

The following datasets belong to this category: VOCSegmentation

`DetectionDataset`

The following datasets belong to this category: CocoDetection, VOCDetection

ToDo

The following datasets need sorting into the three categories: ~Caltech101~, ~Caltech256~, CelebA, CityScapes, Cococaptions, Flickr8k, Flickr30k, ~LSUN~, Omniglot, PhotoTour, SBDataset (shouldn't this be just called SBD?), SBU, ~SEMEION~, ~STL10~, and ~USPS~
~Add some common arguments / parameters for the SegmentationDataset and DetectionDataset~

Thoughts and suggestions?

fmassa commented 5 years ago

Thanks a lot @pmeier for starting this discussion!

Here are my general thoughts:

if we do not yet have enough datasets that fit in Segmentation / Detection dataset, then it might make sense to postpone implementing such abstractions, as they will lack flexibility as they will be tailored to 1 / 2 datasets.
split as positional argument: this would be a breaking change for CocoDetection, and potentially others. Also, what if a dataset doesn't have a canonical train / test split?
download for every dataset: agree, even though this might not be possible for some datasets. But we should strive to have as much coverage as possible

Some more specific thoughts:

`ClassificationDataset`

The bulk of the datasets currently available are in fact classification datasets. For example, LSUN SEMEION, STL10, USPS are all classification datasets. Caltech101 and Caltech256 are mostly used for classification as well, even though they have more information available.

About each dataset having a classes parameter, I wonder if this shouldn't instead be a method that returns a dict with a number of fields (with standardized names as much as possible). For example

def get_metadata(self):
    return dict(classes=[0, 1, 2], boxes=[], ...)

Not sure if this is the right design though, but I was thinking about what other fields we might want to return. For example, the TF datasets have an _info method that returns a number of useful information about the dataset, and we could get some inspiration from it. Their need for this is different than ours though, as they need the features attribute to know how to decode their TFRecords structure.

Also worth raising the visibility of class_to_idx or alike, because it has been the source of a few user issues, like https://github.com/pytorch/vision/issues/714.

Thoughts?

Also, cc @zhangguanheng66 and @cpuhrsch as we might want to have some similarities on the dataset APIs between text and vision

pmeier commented 5 years ago

if we do not yet have enough datasets that fit in Segmentation / Detection dataset, then it might make sense to postpone implementing such abstractions, as they will lack flexibility as they will be tailored to 1 / 2 datasets.

Agreed.

split as positional argument: this would be a breaking change for CocoDetection, and potentially others.

Question(s):

What is the standard way of working with COCODetection? Does everyone simply define their own splits? How can you compare models between users?

Or is this still in challenge mode? In that case we need to decide if we want to have datasets with a single split or potentially a split without targets (see #900).

Also, what if a dataset doesn't have a canonical train / test split?

What do you mean by that? I would define a split as subset of the whole dataset. IMO each split should have the same structure.

download for every dataset: agree, even though this might not be possible for some datasets. But we should strive to have as much coverage as possible

Can you think of a dataset besides ImageFolder and FakeData that can't / shouldn't have this method? If someone has a local dataset that he doesn't want to publish, I think its reasonable to expect him to not set download=True.

About each dataset having a classes parameter, I wonder if this shouldn't instead be a method that returns a dict with a number of fields (with standardized names as much as possible). For example
def get_metadata(self):
    return dict(classes=[0, 1, 2], boxes=[], ...)
[...] Also worth raising the visibility of class_to_idx or alike, because it has been the source of a few user issues, like #714.

I like the idea of classes and idx_to_class as parameter better, since every ClassificationDataset should have these. We could do get_additional_metadata() for everything else (e.g. boxes for Caltech).

As mentioned in #1086 should we protect all internal parameters (e.g. data for CIFAR or SVHN) to alert users to not rely on them since they might change?

zhangguanheng66 commented 5 years ago

I would like to be very careful about adding some functions in VisionDataset, though TorchText has split/download/splits functions in the abstract dataset (https://github.com/pytorch/text/blob/master/torchtext/data/dataset.py#L16). IMO, a simple abstract dataset will give you more flexibility.

We are now testing a new pattern for TorchText. Instead of using abstract dataset (mentioned above), I move some functions to the utils and keep them orthogonal. For example, TorchText.utils.download function. We expect some feedbacks to the new pattern later before we make some significant changes to the abstract dataset.

fmassa commented 5 years ago

@pmeier

What is the standard way of working with COCODetection? Does everyone simply define their own splits? How can you compare models between users?

Each COCO has a different annotation file for train / val, and that is the second argument that we pass, and not split. So the ann_file plays the role of split

What do you mean by that? I would define a split as subset of the whole dataset. IMO each split should have the same structure.

Some datasets do not define a train / test split. So it's up to the user to decide how to split it, for example using torch.utils.data.random_split. Thus, having a split argument to the dataset doesn't apply there.

Can you think of a dataset besides ImageFolder and FakeData that can't / shouldn't have this method? If someone has a local dataset that he doesn't want to publish, I think its reasonable to expect him to not set download=True.

I mean, sometimes the user need to login / authenticate to download certain datasets (see https://github.com/pytorch/vision/issues/1042#issuecomment-508502833 for an example), so we wouldn't be able to provide a download functionality for those.

I like the idea of classes and idx_to_class as parameter better, since every ClassificationDataset should have these. We could do get_additional_metadata() for everything else (e.g. boxes for Caltech).

Maybe, but I still want to double-check a bit before.

cc @aadcock what are your thoughts on this in general?

pmeier commented 5 years ago

IMO, in the end it comes down to two philosophies since there are so many different datasets that we cannot foresee every possible option:

We have a rudimentary abstract dataset that allows a high degree of flexibility. Downside of that flexibility is that contributions need to be manually maintained to keep some standardisation.
We have a more complex abstract dataset that bundles functionality that most datasets possess. Downside of this is that we might need some creative solutions for the edge cases.

I would argue that datasets without split or datasets that cannot be downloaded are edge cases like ImageFolder or FakeData that need special treatment. IMO they are not important enough that the normal datasets miss out on this functionality. I think its not a problem to define something like this in those cases:

def _download():
    raise RuntimeError("Dataset cannot be downloaded")

To cite the Python Zen:

Special cases aren't special enough to break the rules. Although practicality beats purity.

Right now we roll with option 1. and as a result of the heterogeneity we started this discussion about standardisation.

fmassa commented 5 years ago

@pmeier one core principle of torchvision (and PyTorch as well) is that most things should be easily understandable. So if a user wants to create / modify a new dataset, it is very easy for them to do so by reading the source code, without having to jump over a lot of abstractions.

I do think that having the right level of abstractions is a desirable thing, but we should carefully design it so that it is not a blocker for us in the future.

And it's easier to add something than to remove it, so before making all datasets have a split / download argument to it, I'd rather wait a bit more and see what the others have to say.

One could say that ideally, we would want to have a common API like

d1 = torchvision.datasets.get('coco', 'train2017', transform, download=True)
d2 = torchvision.datasets.get('voc-detection', 'train', transform)
d3 = torchvision.datasets.get('imagenet', 'train2012', transform)
d4 = torchvision.datasets.get('mnist', 'train', transform)
...

all with the same high-level API. But this makes some assumptions, and maybe we might want to enforce those assumptions/limitations at a higher level than the Dataset API.

Sure, this doesn't address what attributes each dataset should have, and that's one of the points that I want to discuss in this issue.

zhangguanheng66 commented 5 years ago

@fmassa IMO, the abstract dataset could include at least data and transforms (field for torchtext) and full implementation of getitem, len. Those components are highly repetitive across most dataset (if my understanding is correct). something like this:

`

class TextDataset(torch.utils.data.Dataset):

    def __init__(data, fields):
        self.data = data
        self.fields = fields

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return len(self.data)

`

We could carefully add more stuffs later on as the abstract class evolves.

pmeier commented 5 years ago

@zhangguanheng66 transforms as well as the root of the dataset are already part of the abstract dataset class called VisionDataset. I agree on the other part. We could do something like this:

class VisionDataset(...):
    ...

    def __getitem__(self, index):
        image, target = self._images[index], self.targets[index]
        if self.transform is not None:
            image, target = self.transform(image, target)
        return image, target

    def __len__(self):
        return len(self._images)

    @property
    def _images(self):
        raise NotImplementedError

    @property
    def _targets(self):
        raise NotImplementedError

zhangguanheng66 commented 5 years ago

@pmeier Sure, I agree. I'm just wondering the pros and cons to include data (e.g. images, targets) in the abstract dataset class. I guess it won't hurt us, right?

aadcock commented 5 years ago

I'm late to the party here and have comments on a few different pieces of this thread. Here are a few thoughts I've collected about things that I see commonly requested:

1) I also agree that @zhangguanheng66's suggestion is a good place to start since almost every dataset has these options. 2) Already mentioned, a get_metadata call of some sort on the dataset is one of the most requested items I get from the scientists I work with so that they can write code that reacts to the data that a dataset returns without having to retrieve samples or build / download the dataset. Standardizing what this returns is a bit tricky, but the call is pretty straightforward. 3) A lot of users ask about creating a canonical way to attach a sample id to each sample to use for debugging or logging information / features for a sample. This also feels like a pretty general use-case. It's not hard to create your own, but if there was a standard way of doing this, it would make it easier to write reusable code...I've solved this in the past with returning samples as dicts with fields like, "input" , "target", "metadata", but there are downstream problems with some of the composable dataset APIs. 4) @pmeier, thanks for the initial thoughts! I think they are really useful. I do think that ImageFolder is special for more than just the lack of a download. It provides an API to a datastore which could be different for each instance while ClassificationDataset et al are more like interfaces for the underlying datasets to implement. I agree they are both datasets / can be viewed in a similar framework, but my point here is that when someone is using an ImageFolder they are frequently doing something a bit different than when a user would be building a ClassificationDataset (and I think it's actually pretty common if you have any sort of proprietary data). You could even imagine using an ImageFolder to build a classification dataset, for example. 5) A minor point about the ClassificationDataset / other abstractions. A common use case that is not mentioned above with the get_metadata function / Classification datasets is that we might want to allow multi-class datasets (so you would return multiple ints). Admittedly, this quickly gets into attempting to cover every type of dataset, but this is something I work with a lot.

pmeier commented 5 years ago

Since there has been no more input for some days, can I regard the proposal of @zhangguanheng66 as consensus and thus add it to the top post?

fmassa commented 5 years ago

@pmeier we are still discussing about this internally, and we still seem to have no consensus on how to structure it yet.

koenvandesande commented 5 years ago

How does this approach scale to different tasks/datasets with multiple tasks? On Coco, you can do detection, human keypoint detection and segmentation (of either objects, stuff with CocoStuff or everything toghether with panoptic segmentation).

Another part of "the Python way" is duck typing: "If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck". No need to have a common superclass to implement an interface, as long as the required parts are there.

zhiqwang commented 5 years ago

For multiple tasks, may we choose the implementation of fairseq, where have a Standardization of the tasks ?

ahmed-shariff commented 5 years ago

Adding to the conversation, I also have been trying to come up with a solution such that I can use a similar interface to load different types of data. The structure that I have for now is https://gist.github.com/ahmed-shariff/dbf3bee6fa2bf4924f95ac2b65620386, which I am trying to include in my pipeline. The idea is I am trying to go a step higher. Based on the dataset and how it's used, I'd implement the train_data_load_function and Dataset classes, which will be used to load and provide the data in a uniform way. As @aadcock points out, How the data is to be returned is entirely dependent on how it'll be used.

kirk86 commented 5 years ago

This standardization will make writing code so much cleaner, now it's like every constructor of every dataset has its own params and makes things so unflexible during development. For instance cifar10 has the train param indicating which subset is used while svhn has split. My two cents are to draw inspiration from the flexibility and modularity of the linux system. For instance if for some datasets there's no download possible due to login or other things then allow the user to make their own download method and hook that to the DataLoader. But, the naming convention should be the same across all of them, either that be it loader.dataset.data and loader.dataset.targets or something else. The most sensible convention for constructor arguments in order to split the dataset would have been train=True/False if a dataset doesn't have split then whenever train=False is invoked raise a warning to the user to let them know that there's not split and they'll have to do their own split.

wolterlw commented 4 years ago

Any progress on this topic? I wanted to add wrappers for several pose estimation datasets (one of the popular tasks not mentioned here) because so far every research team builds them differently and it's quite a nightmare out there.
I support @ahmed-shariff's argument that we need the ability to specify data format for a specific task. This is especially relevant when working with 3d perception of videos, where each sample is not {image, class_id} but a batch with camera parameters that are batch-specific and frame-specific keypoint locations.
I'm not sure if that's something to be solved using a special Sampler perhaps, so far I build my own Datasets to only handle indexing and data loading in done with transforms, which is quite versatile.

fmassa commented 4 years ago

@wolterlw there has been no further progress on this topic for now.

As you point out, each task has its own particularities and there is no single abstraction (more specific than just Dataset) that can cover them all.

My current thinking is that there should be not much structure that is imposed by default -- the more structure we add, the less flexible we will be. Worse still, trying to add elaborated abstractions to cover many cases will just make the entry-cost to the library higher.

One option that I considered at some point was just to let the datasets from torchvision return the raw data as they are stored originally. This gives full flexibility on how the user treats it. Then, each different pipeline / application can have a specific transform / TransformDataset which converts the data into the desired format.

Something in the lines of

class MyDataset:
    def __getitem__(self, idx):
        ...
        return dict(image=image, image_id=image_id, params=params)

class MyDatasetFormatter:
    def __init__(self, dataset):
        self.dataset = dataset

    def __getitem__(self, idx):
        data = self.dataset[idx]
        return data['image'], data['params']

One reason to do so is because the output format of the dataset is tightly coupled with the training loop that you have. And because torchvision doesn't provide training loops in the library itself (but only reference training scripts), I haven't moved forward with this DatasetTransform approach.

Let me know what you think

ahmed-shariff commented 4 years ago

@fmassa I agree with that. The code I had shared before actually was built around that idea (in retrospect, that code is a little wonky because I am trying to do alot of things at the same time). How the data is loaded needs to be decoupled from how the data is used by a pipeline.

ahmed-shariff commented 4 years ago

Does something along these lines make sense:

dataset_foo = DatasetFoo()

foobar_pipeline_train_dataset = BarPipelineDataFormatter(dataset_foo.train_data(), mode=TRAIN)
foobar_pipeline_test_dataset = BarPipelineDataFormatter(dataset_foo.test_data(), mode=TEST)

wolterlw commented 4 years ago

@fmassa I've mentioned in #963 that using a Dataset subclass to simply index all the files and offload the rest of the work to Transforms seems to be reasonably flexible.
Also I want to stress that the problem right now is that most research projects that utilize public datasets write their own wrappers (which vary a lot in readability and speed of execution) as there are no general guidelines on how to build those and there's no single place to look if somebody else has done the work earlier.

It's unreasonable to spend this much time rewriting pretty much the same functionality. I've started a discussion on PyTorch forum, but it didn't really lead anywhere yet. I propose to create an addition to Hub or even a simple github repo that would house wrappers for research datasets

.
├── dataset1
│   ├── dataset.py
│   ├── transforms.py
│   ├── viz_tools.py
│   └── README
├── dataset2
│   ├── dataset.py
│   ├── transforms.py
│   ├── viz_tools.py
│   └── README
├── dataset3
│   ├── dataset.py
│   ├── transforms.py
│   ├── viz_tools.py
│   └── README
├── CONTRIBUTING
└── README

Each dataset's README would tell you:

how to download the dataset
what is the expected directory structure
how annotations are structured and encoded
provide visualizations of available annotations

transforms.py should provide dataset-specific transforms and viz_tools.py - functions to visualize the annotation.

Point being there should be a single place where people can look for dataset wrappers that are ready to be used, because in my practice it takes a lot of time to build those for yourself.

Let me know what you think

tyleryzhu commented 3 years ago

I also recently came across an inconsistency between datasets which is slightly annoying, and figured I'd mention it here instead of in a new issue (but happy to do so if it's easier). Currently, the targets of CIFAR-10 and CIFAR-100 are of type list, while the targets of MNIST are of type torch.Tensor (compare the CIFAR code with the MNIST code). It would be good to standardize the types all as torch.Tensor.

pmeier commented 3 years ago

@tyleryzhu You are right that this is inconsistent, but we never aimed for consistency at the attribute level. The target attribute is not documented anywhere so we simply failed to make this private. The return type of __getitem__ should be int in both cases, i.e. isinstance(dataset[idx][1], int) should hold for both datasets.

pytorch / vision