salesforce / awd-lstm-lm

LSTM and QRNN Language Model Toolkit for PyTorch
BSD 3-Clause "New" or "Revised" License
1.96k stars 488 forks source link

DataParallel #64

Open djstrong opened 6 years ago

djstrong commented 6 years ago

I am training to run the model on multiple GPUs. Probably SplitCrossEntropyLoss causes some troubles, any hints?

File "main.py", line 209, in train
    raw_loss = criterion(model.module.decoder.weight, model.module.decoder.bias, output, targets)
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 114, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 124, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply
    raise output
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 41, in _worker
    output = module(*input, **kwargs)
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/net/scratch/people/plgkwrobel/awd-lstm-lm/splitcross.py", line 115, in forward
    split_targets, split_hiddens = self.split_on_targets(hiddens, targets)
  File "/net/scratch/people/plgkwrobel/awd-lstm-lm/splitcross.py", line 103, in split_on_targets
    split_hiddens.append(hiddens.masked_select(tmp_mask.unsqueeze(1).expand_as(hiddens)).view(-1, hiddens.size(1)))
  File "/net/people/plgkwrobel/env-pytorch/lib/python3.6/site-packages/torch/tensor.py", line 302, in expand_as
    return self.expand(tensor.size())
RuntimeError: The expanded size of the tensor (280) must match the existing size (550) at non-singleton dimension 0
akurniawan commented 6 years ago

I am experiencing the same problem, @djstrong did you able to solve it?

akurniawan commented 6 years ago

Found the fix:

  1. You need to add dim=1 in your nn.DataParallel constructor parameter as the data passed through your network will be in the form of [steps, batch_size, dims] already, DataParallel needs to know which dimension you want to split.
  2. You need to move https://github.com/salesforce/awd-lstm-lm/blob/32fcb42562aeb5c7e6c9dec3f2a3baaaf68a5cb5/model.py#L93 out, since nn.DataParallel will use the modified dim parameters to merge the final result. That line will flatten the tensor from [steps, batch, dims] into [steps * batch, dims], if you define dim=1, instead of merging the result in dim steps * batch, it will merge the result with dims dimension.

Let me know if it doesn't work for you!

djstrong commented 6 years ago

Thanks! I will try.