lanwuwei / SPM_toolkit

Neural network toolkit for sentence pair modeling.
302 stars 70 forks source link

Train SNLI model with multiple GPU #21

Closed aayushee closed 5 years ago

aayushee commented 5 years ago

Hi

I am trying to train the ESIM SNLI model on my own dataset where some premise lengths are very long (around 5500). I increased the maximum length but then my GPU is not able to handle the the data with 64 batch size. I have multiple GPU environment and set CUDA_VISIBLE_DEVICES variable to multiple GPUs but the code still uses one GPU. I also wrapped the model in DataParallel pipeline shown here (https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html) as follows:

model = ESIM(dim_word, 3, n_words, dim_word, pretrained_emb) model=nn.DataParallel(model) if torch.cuda.is_available(): model = model.cuda() criterion = criterion.cuda() I get the this error on doing so: ValueError: Expected input batch_size (256) to match target batch_size (64)

I am not able to understand why target batch size isn't changed when multiple GPUs are used. Can anyone tell me how to use multiple GPUs to train the ESIM SNLI model? Or if there is any other way to handle large sequence length in the model?

lanwuwei commented 5 years ago

Hi, It will be very interesting if you can make it work on multi GPU environment. In my experiment, I only have one GPU to use, so I didn't try your case. But I think DataParallel is right choice, your current bug may be caused by some incorrect settings. Multi-GPU can handle huge amount of training data, but it doesn't help very long (~5500) sentence. My solution: 1) if the percentage of long sentence is not significant, just discard; 2) if it is significant, you can treat it as a paragraph, not a sentence, then split it into many short sentences and go through LSTM.

aayushee commented 5 years ago

Hi, Thanks for your reply and suggestions. I realized multi-gpu won't handle such long sequences. I am able to discard a lot of unnecessary text and sequence length gets shortened. I can discard few data points with very long sequence length as well. The model performs reasonably well with less sequence length now. In case I still need to work with longer sequences, then I will follow the LSTM route.