tristandeleu / pytorch-meta

A collection of extensions and data-loaders for few-shot learning & meta-learning in PyTorch
https://tristandeleu.github.io/pytorch-meta/
MIT License
1.97k stars 256 forks source link

How to process the whole batch #102

Open minhtriet opened 3 years ago

minhtriet commented 3 years ago

Currently I am relying on __get_item(self,index)__ to tokenize sentence index. However, there is a way to more effectively tokenize the whole batch, instead of individual sentence. Could this be done in pytorch-meta, I have yet found one in the examples.

tristandeleu commented 3 years ago

Most datasets currently available in Torchmeta rely on a hierarchy of three objects:

Something you could do in your case is to tokenize all the elements of Dataset at once, because this is essentially a batch of data (from which the sampler is going to sample from to create the actual datasets for the task).

Another option could be to look into how to allow __getitem__(index) to get a batch (list) of indices for index. This is already possible in standard PyTorch datasets, and since Torchmeta datasets are essentially instances of PyTorch datasets this could be possible. I have tried to include that at some point in Torchmeta to improve sampling, but there was no particular improvement for image datasets, especially since processing by Torchvision transforms (e.g. image loading, Resize, etc...) only accept single images, so I ended up not continuing further.