Open minhtriet opened 3 years ago
Most datasets currently available in Torchmeta rely on a hierarchy of three objects:
Dataset
, which is simply a PyTorch dataset, which is responsible for getting the individual examples for a given label. For example it can be a dataset containing all the (20) examples of the letter A
in Omniglot.ClassDataset
, which is producing the datasets for different classes. Each index of this class corresponds to a single label. For example in Omniglot, this contains 1028 elements, and class_dataset[0]
returns an instance of Dataset
(above) containing all the examples of images_background/Alphabet_of_the_Magi/character01
.CombinationMetaDataset
combines multiple indices (for example (0, 1, 2, 3, 4)
) to create a task over the corresponding labels, the individual indices corresponding to the ones in ClassDataset
above.Something you could do in your case is to tokenize all the elements of Dataset
at once, because this is essentially a batch of data (from which the sampler is going to sample from to create the actual datasets for the task).
Another option could be to look into how to allow __getitem__(index)
to get a batch (list) of indices for index
. This is already possible in standard PyTorch datasets, and since Torchmeta datasets are essentially instances of PyTorch datasets this could be possible. I have tried to include that at some point in Torchmeta to improve sampling, but there was no particular improvement for image datasets, especially since processing by Torchvision transforms (e.g. image loading, Resize
, etc...) only accept single images, so I ended up not continuing further.
Currently I am relying on
__get_item(self,index)__
to tokenize sentenceindex
. However, there is a way to more effectively tokenize the whole batch, instead of individual sentence. Could this be done inpytorch-meta
, I have yet found one in the examples.