salesforce / awd-lstm-lm

LSTM and QRNN Language Model Toolkit for PyTorch
BSD 3-Clause "New" or "Revised" License
1.96k stars 488 forks source link

DataParallel #64

Open djstrong opened 6 years ago

djstrong commented 6 years ago

I am training to run the model on multiple GPUs. Probably SplitCrossEntropyLoss causes some troubles, any hints?

File "", line 209, in train
    raw_loss = criterion(model.module.decoder.weight, model.module.decoder.bias, output, targets)
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/modules/", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/parallel/", line 114, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/parallel/", line 124, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/parallel/", line 65, in parallel_apply
    raise output
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/parallel/", line 41, in _worker
    output = module(*input, **kwargs)
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/modules/", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/net/scratch/people/plgkwrobel/awd-lstm-lm/", line 115, in forward
    split_targets, split_hiddens = self.split_on_targets(hiddens, targets)
  File "/net/scratch/people/plgkwrobel/awd-lstm-lm/", line 103, in split_on_targets
    split_hiddens.append(hiddens.masked_select(tmp_mask.unsqueeze(1).expand_as(hiddens)).view(-1, hiddens.size(1)))
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/", line 302, in expand_as
    return self.expand(tensor.size())
RuntimeError: The expanded size of the tensor (280) must match the existing size (550) at non-singleton dimension 0
akurniawan commented 6 years ago

I am experiencing the same problem, @djstrong did you able to solve it?

akurniawan commented 6 years ago

Found the fix:

  1. You need to add dim=1 in your nn.DataParallel constructor parameter as the data passed through your network will be in the form of [steps, batch_size, dims] already, DataParallel needs to know which dimension you want to split.
  2. You need to move out, since nn.DataParallel will use the modified dim parameters to merge the final result. That line will flatten the tensor from [steps, batch, dims] into [steps * batch, dims], if you define dim=1, instead of merging the result in dim steps * batch, it will merge the result with dims dimension.

Let me know if it doesn't work for you!

djstrong commented 6 years ago

Thanks! I will try.