pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.51k stars 811 forks source link

How to use datasets for distributed training? #669

Open styxjedi opened 4 years ago

styxjedi commented 4 years ago

❓ Questions and Help

Description

I built a dataset from my corpus, and use each line as an Example. It works fine at first until I try to use it for distributed training.

It seems that torch.nn.parallel.DistributedParallel has to use DistributedSampler, but it's not compatible with torchtext datasets.

Is there any idea to use torchtext datasets for distributed training? Thx!

zhangguanheng66 commented 4 years ago

I don't think those legacy datasets are compatible with torch.nn.parallel.DistributedParallel. Those new datasets in torchtext should be.

styxjedi commented 4 years ago

Yes, I think you're right.

But what if I create a new corpus by using torchtext.data, how can I make it compatible with torch.nn.parallel.DistributedParallel or torch.nn.utils.data.DataLoader?

jiangxiluning commented 4 years ago

I need it too, shame

zhangguanheng66 commented 4 years ago

You need to write the dataset as a list (see the self.data part in the new datasets in torchtext.experimental). Then, it should be compatible with torch.nn.utils.data.DataLoader.