Open pmeier opened 5 years ago
Thanks a lot @pmeier for starting this discussion!
Here are my general thoughts:
Segmentation
/ Detection
dataset, then it might make sense to postpone implementing such abstractions, as they will lack flexibility as they will be tailored to 1 / 2 datasets.CocoDetection
, and potentially others. Also, what if a dataset doesn't have a canonical train / test split?download
for every dataset: agree, even though this might not be possible for some datasets. But we should strive to have as much coverage as possibleSome more specific thoughts:
ClassificationDataset
The bulk of the datasets currently available are in fact classification datasets.
For example, LSUN
SEMEION
, STL10
, USPS
are all classification datasets.
Caltech101
and Caltech256
are mostly used for classification as well, even though they have more information available.
About each dataset having a classes
parameter, I wonder if this shouldn't instead be a method that returns a dict with a number of fields (with standardized names as much as possible).
For example
def get_metadata(self):
return dict(classes=[0, 1, 2], boxes=[], ...)
Not sure if this is the right design though, but I was thinking about what other fields we might want to return. For example, the TF datasets have an _info
method that returns a number of useful information about the dataset, and we could get some inspiration from it. Their need for this is different than ours though, as they need the features
attribute to know how to decode their TFRecords structure.
Also worth raising the visibility of class_to_idx
or alike, because it has been the source of a few user issues, like https://github.com/pytorch/vision/issues/714.
Thoughts?
Also, cc @zhangguanheng66 and @cpuhrsch as we might want to have some similarities on the dataset APIs between text and vision
- if we do not yet have enough datasets that fit in
Segmentation
/Detection
dataset, then it might make sense to postpone implementing such abstractions, as they will lack flexibility as they will be tailored to 1 / 2 datasets.
Agreed.
- split as positional argument: this would be a breaking change for
CocoDetection
, and potentially others.
Question(s):
What is the standard way of working with COCODetection
? Does everyone simply define their own split
s? How can you compare models between users?
Or is this still in challenge mode? In that case we need to decide if we want to have datasets with a single split
or potentially a split without target
s (see #900).
Also, what if a dataset doesn't have a canonical train / test split?
What do you mean by that? I would define a split
as subset of the whole dataset. IMO each split
should have the same structure.
download
for every dataset: agree, even though this might not be possible for some datasets. But we should strive to have as much coverage as possible
Can you think of a dataset besides ImageFolder
and FakeData
that can't / shouldn't have this method? If someone has a local dataset that he doesn't want to publish, I think its reasonable to expect him to not set download=True
.
About each dataset having a
classes
parameter, I wonder if this shouldn't instead be a method that returns a dict with a number of fields (with standardized names as much as possible). For exampledef get_metadata(self): return dict(classes=[0, 1, 2], boxes=[], ...)
[...] Also worth raising the visibility of
class_to_idx
or alike, because it has been the source of a few user issues, like #714.
I like the idea of classes
and idx_to_class
as parameter better, since every ClassificationDataset
should have these. We could do get_additional_metadata()
for everything else (e.g. boxes for Caltech
).
As mentioned in #1086 should we protect all internal parameters (e.g. data
for CIFAR
or SVHN
) to alert users to not rely on them since they might change?
I would like to be very careful about adding some functions in VisionDataset, though TorchText has split/download/splits functions in the abstract dataset (https://github.com/pytorch/text/blob/master/torchtext/data/dataset.py#L16). IMO, a simple abstract dataset will give you more flexibility.
We are now testing a new pattern for TorchText. Instead of using abstract dataset (mentioned above), I move some functions to the utils and keep them orthogonal. For example, TorchText.utils.download function. We expect some feedbacks to the new pattern later before we make some significant changes to the abstract dataset.
@pmeier
What is the standard way of working with COCODetection? Does everyone simply define their own splits? How can you compare models between users?
Each COCO has a different annotation file for train
/ val
, and that is the second argument that we pass, and not split
. So the ann_file
plays the role of split
What do you mean by that? I would define a split as subset of the whole dataset. IMO each split should have the same structure.
Some datasets do not define a train / test split. So it's up to the user to decide how to split it, for example using torch.utils.data.random_split
. Thus, having a split
argument to the dataset doesn't apply there.
Can you think of a dataset besides ImageFolder and FakeData that can't / shouldn't have this method? If someone has a local dataset that he doesn't want to publish, I think its reasonable to expect him to not set download=True.
I mean, sometimes the user need to login / authenticate to download certain datasets (see https://github.com/pytorch/vision/issues/1042#issuecomment-508502833 for an example), so we wouldn't be able to provide a download
functionality for those.
I like the idea of classes and idx_to_class as parameter better, since every ClassificationDataset should have these. We could do get_additional_metadata() for everything else (e.g. boxes for Caltech).
Maybe, but I still want to double-check a bit before.
cc @aadcock what are your thoughts on this in general?
IMO, in the end it comes down to two philosophies since there are so many different datasets that we cannot foresee every possible option:
I would argue that datasets without split
or datasets that cannot be download
ed are edge cases like ImageFolder
or FakeData
that need special treatment. IMO they are not important enough that the normal datasets miss out on this functionality. I think its not a problem to define something like this in those cases:
def _download():
raise RuntimeError("Dataset cannot be downloaded")
To cite the Python Zen:
Special cases aren't special enough to break the rules. Although practicality beats purity.
Right now we roll with option 1. and as a result of the heterogeneity we started this discussion about standardisation.
@pmeier one core principle of torchvision (and PyTorch as well) is that most things should be easily understandable. So if a user wants to create / modify a new dataset, it is very easy for them to do so by reading the source code, without having to jump over a lot of abstractions.
I do think that having the right level of abstractions is a desirable thing, but we should carefully design it so that it is not a blocker for us in the future.
And it's easier to add something than to remove it, so before making all datasets have a split
/ download
argument to it, I'd rather wait a bit more and see what the others have to say.
One could say that ideally, we would want to have a common API like
d1 = torchvision.datasets.get('coco', 'train2017', transform, download=True)
d2 = torchvision.datasets.get('voc-detection', 'train', transform)
d3 = torchvision.datasets.get('imagenet', 'train2012', transform)
d4 = torchvision.datasets.get('mnist', 'train', transform)
...
all with the same high-level API. But this makes some assumptions, and maybe we might want to enforce those assumptions/limitations at a higher level than the Dataset
API.
Sure, this doesn't address what attributes each dataset should have, and that's one of the points that I want to discuss in this issue.
@fmassa IMO, the abstract dataset could include at least data and transforms (field for torchtext) and full implementation of getitem, len. Those components are highly repetitive across most dataset (if my understanding is correct). something like this:
`
class TextDataset(torch.utils.data.Dataset):
def __init__(data, fields):
self.data = data
self.fields = fields
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return len(self.data)
`
We could carefully add more stuffs later on as the abstract class evolves.
@zhangguanheng66 transform
s as well as the root
of the dataset are already part of the abstract dataset class called VisionDataset
. I agree on the other part. We could do something like this:
class VisionDataset(...):
...
def __getitem__(self, index):
image, target = self._images[index], self.targets[index]
if self.transform is not None:
image, target = self.transform(image, target)
return image, target
def __len__(self):
return len(self._images)
@property
def _images(self):
raise NotImplementedError
@property
def _targets(self):
raise NotImplementedError
@pmeier Sure, I agree. I'm just wondering the pros and cons to include data (e.g. images, targets) in the abstract dataset class. I guess it won't hurt us, right?
I'm late to the party here and have comments on a few different pieces of this thread. Here are a few thoughts I've collected about things that I see commonly requested:
1) I also agree that @zhangguanheng66's suggestion is a good place to start since almost every dataset has these options. 2) Already mentioned, a get_metadata call of some sort on the dataset is one of the most requested items I get from the scientists I work with so that they can write code that reacts to the data that a dataset returns without having to retrieve samples or build / download the dataset. Standardizing what this returns is a bit tricky, but the call is pretty straightforward. 3) A lot of users ask about creating a canonical way to attach a sample id to each sample to use for debugging or logging information / features for a sample. This also feels like a pretty general use-case. It's not hard to create your own, but if there was a standard way of doing this, it would make it easier to write reusable code...I've solved this in the past with returning samples as dicts with fields like, "input" , "target", "metadata", but there are downstream problems with some of the composable dataset APIs. 4) @pmeier, thanks for the initial thoughts! I think they are really useful. I do think that ImageFolder is special for more than just the lack of a download. It provides an API to a datastore which could be different for each instance while ClassificationDataset et al are more like interfaces for the underlying datasets to implement. I agree they are both datasets / can be viewed in a similar framework, but my point here is that when someone is using an ImageFolder they are frequently doing something a bit different than when a user would be building a ClassificationDataset (and I think it's actually pretty common if you have any sort of proprietary data). You could even imagine using an ImageFolder to build a classification dataset, for example. 5) A minor point about the ClassificationDataset / other abstractions. A common use case that is not mentioned above with the get_metadata function / Classification datasets is that we might want to allow multi-class datasets (so you would return multiple ints). Admittedly, this quickly gets into attempting to cover every type of dataset, but this is something I work with a lot.
Since there has been no more input for some days, can I regard the proposal of @zhangguanheng66 as consensus and thus add it to the top post?
@pmeier we are still discussing about this internally, and we still seem to have no consensus on how to structure it yet.
How does this approach scale to different tasks/datasets with multiple tasks? On Coco, you can do detection, human keypoint detection and segmentation (of either objects, stuff with CocoStuff or everything toghether with panoptic segmentation).
Another part of "the Python way" is duck typing: "If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck". No need to have a common superclass to implement an interface, as long as the required parts are there.
Adding to the conversation, I also have been trying to come up with a solution such that I can use a similar interface to load different types of data. The structure that I have for now is https://gist.github.com/ahmed-shariff/dbf3bee6fa2bf4924f95ac2b65620386, which I am trying to include in my pipeline. The idea is I am trying to go a step higher. Based on the dataset and how it's used, I'd implement the train_data_load_function
and Dataset
classes, which will be used to load and provide the data in a uniform way. As @aadcock points out, How the data is to be returned is entirely dependent on how it'll be used.
This standardization will make writing code so much cleaner, now it's like every constructor of every dataset has its own params and makes things so unflexible during development. For instance cifar10 has the train param indicating which subset is used while svhn has split. My two cents are to draw inspiration from the flexibility and modularity of the linux system. For instance if for some datasets there's no download possible due to login or other things then allow the user to make their own download method and hook that to the DataLoader. But, the naming convention should be the same across all of them, either that be it loader.dataset.data
and loader.dataset.targets
or something else. The most sensible convention for constructor arguments in order to split the dataset would have been train=True/False
if a dataset doesn't have split then whenever train=False
is invoked raise a warning to the user to let them know that there's not split and they'll have to do their own split.
Any progress on this topic?
I wanted to add wrappers for several pose estimation datasets (one of the popular tasks not mentioned here) because so far every research team builds them differently and it's quite a nightmare out there.
I support @ahmed-shariff's argument that we need the ability to specify data format for a specific task. This is especially relevant when working with 3d perception of videos, where each sample is not {image, class_id}
but a batch with camera parameters that are batch-specific and frame-specific keypoint locations.
I'm not sure if that's something to be solved using a special Sampler perhaps, so far I build my own Datasets to only handle indexing and data loading in done with transforms, which is quite versatile.
@wolterlw there has been no further progress on this topic for now.
As you point out, each task has its own particularities and there is no single abstraction (more specific than just Dataset
) that can cover them all.
My current thinking is that there should be not much structure that is imposed by default -- the more structure we add, the less flexible we will be. Worse still, trying to add elaborated abstractions to cover many cases will just make the entry-cost to the library higher.
One option that I considered at some point was just to let the datasets from torchvision return the raw data as they are stored originally. This gives full flexibility on how the user treats it. Then, each different pipeline / application can have a specific transform / TransformDataset
which converts the data into the desired format.
Something in the lines of
class MyDataset:
def __getitem__(self, idx):
...
return dict(image=image, image_id=image_id, params=params)
class MyDatasetFormatter:
def __init__(self, dataset):
self.dataset = dataset
def __getitem__(self, idx):
data = self.dataset[idx]
return data['image'], data['params']
One reason to do so is because the output format of the dataset is tightly coupled with the training loop that you have. And because torchvision doesn't provide training loops in the library itself (but only reference training scripts), I haven't moved forward with this DatasetTransform
approach.
Let me know what you think
@fmassa I agree with that. The code I had shared before actually was built around that idea (in retrospect, that code is a little wonky because I am trying to do alot of things at the same time). How the data is loaded needs to be decoupled from how the data is used by a pipeline.
Does something along these lines make sense:
dataset_foo = DatasetFoo()
foobar_pipeline_train_dataset = BarPipelineDataFormatter(dataset_foo.train_data(), mode=TRAIN)
foobar_pipeline_test_dataset = BarPipelineDataFormatter(dataset_foo.test_data(), mode=TEST)
@fmassa I've mentioned in #963 that using a Dataset subclass to simply index all the files and offload the rest of the work to Transforms seems to be reasonably flexible.
Also I want to stress that the problem right now is that most research projects that utilize public datasets write their own wrappers (which vary a lot in readability and speed of execution) as there are no general guidelines on how to build those and there's no single place to look if somebody else has done the work earlier.
It's unreasonable to spend this much time rewriting pretty much the same functionality. I've started a discussion on PyTorch forum, but it didn't really lead anywhere yet. I propose to create an addition to Hub or even a simple github repo that would house wrappers for research datasets
.
├── dataset1
│ ├── dataset.py
│ ├── transforms.py
│ ├── viz_tools.py
│ └── README
├── dataset2
│ ├── dataset.py
│ ├── transforms.py
│ ├── viz_tools.py
│ └── README
├── dataset3
│ ├── dataset.py
│ ├── transforms.py
│ ├── viz_tools.py
│ └── README
├── CONTRIBUTING
└── README
Each dataset's README would tell you:
transforms.py
should provide dataset-specific transforms and viz_tools.py
- functions to visualize the annotation.
Point being there should be a single place where people can look for dataset wrappers that are ready to be used, because in my practice it takes a lot of time to build those for yourself.
Let me know what you think
I also recently came across an inconsistency between datasets which is slightly annoying, and figured I'd mention it here instead of in a new issue (but happy to do so if it's easier). Currently, the targets of CIFAR-10 and CIFAR-100 are of type list, while the targets of MNIST are of type torch.Tensor (compare the CIFAR code with the MNIST code). It would be good to standardize the types all as torch.Tensor.
@tyleryzhu You are right that this is inconsistent, but we never aimed for consistency at the attribute level. The target
attribute is not documented anywhere so we simply failed to make this private. The return type of __getitem__
should be int
in both cases, i.e. isinstance(dataset[idx][1], int)
should hold for both datasets.
This is a discussion issue which was kicked of by #1067. Some PRs that contain ideas are #1015 and #1025. I will update this comment regularly with the achieved consensus during the discussion.
Disclaimer: I have never worked with segmentation or detection datasets. If I make same wrong assumption regarding them, feel free to correct me. Furthermore, please help me to fill in the gaps.
Proposed Structure
This issues presents the idea to standardize the
torchvision.datasets
. This could be done by adding parameters to theVisionDataset
(split
) or by subclassing it and add task specific parameters (classes
orclass_to_idx
) to the newclass
es. I imagine it something like this:For our tests we could then have a
generic_*_dataset_test
as is already implement forClassificationDataset
s.VisionDataset
As discussed in #1067 we could unify the argument that selects different parts of the dataset. IMO
split
as astr
is the most general, but still clear term for this. I would implement this as positional argument within the constructor. This should work for all datasets, since in order to be useful each dataset should have at least a training and a test split. Exceptions to this are theFakedata
andImageFolder
datasets, which will be discussed separately.IMO every dataset should have a
_download
method in order to be useful for every user of this package. We could have the constructor havedownload=True
as keyword argument and call thedownload
method within it. As above, theFakedata
andImageFolder
datasets will be discussed below.Fakedata
andImageFolder
What makes these two datasets special, is that there is nothing to download and they are not splitted in any way. IMO they are not special enough to not generalise the
VisionDataset
as stated above. I propose that we simply remove thesplit
anddownload
argument from their constructor and raise an exception if someone calls thedownload
method.Furthermore the
Fakedata
dataset is currently aClassificationDataset
. We should also create aFakeSegmentationData
and aFakeDetectionData
dataset.ClassificationDataset
The following datasets belong to this category:
CIFAR*
,ImageNet
,*MNIST
,SVHN
,LSUN
,SEMEION
,STL10
,USPS
,Caltech*
PIL.Image, int
if indexedclasses
parameter, which is atuple
with all available classes in human-readable formclass_to_idx
parameter, which is dictionary that maps the human-readable class to its index used as target. I propose to change the direction, i.e. create aidx_to_class
parameter, since IMO this is the far more common transformation.SegmentationDataset
The following datasets belong to this category:
VOCSegmentation
DetectionDataset
The following datasets belong to this category:
CocoDetection
,VOCDetection
ToDo
Caltech101
~, ~Caltech256
~,CelebA
,CityScapes
,Cococaptions
,Flickr8k
,Flickr30k
, ~LSUN
~,Omniglot
,PhotoTour
,SBDataset
(shouldn't this be just calledSBD
?),SBU
, ~SEMEION
~, ~STL10
~, and ~USPS
~SegmentationDataset
andDetectionDataset
~Thoughts and suggestions?