kaituoxu commented 5 years ago

那个ngpu默认设为1，改为其他数字没有影响
多GPU暂时还不支持，如果有需要的话，可以使用PyTorch的nn.DataParallel()

chenpe32cp commented 5 years ago

我按照网上的方法加上DataPallel后，报Gather got an input of invalid size: got [10,35,9961], but expected [10,26,9961] 的错误，没找到啥问题，请问您知道这大概是啥问题导致的吗？

kaituoxu commented 5 years ago

@chenpe32cp 设置batch_first了吗

chenpe32cp commented 5 years ago

请问batch_first这个参数应该在什么地方设置呢？

---原始邮件--- 发件人: "Kaituo XU 许开拓"notifications@github.com 发送时间: 2019年7月15日(星期一) 上午10:17 收件人: "kaituoxu/Speech-Transformer"Speech-Transformer@noreply.github.com; 抄送: "Mention"mention@noreply.github.com;"ChenPeng"chenpeng0538@qq.com; 主题: Re: [kaituoxu/Speech-Transformer] 多gpu训练的时候改变run.sh里的ngpu的值并不能有效果，这是为啥 (#2)

@chenpe32cp 设置batch_first了吗

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

kaituoxu commented 5 years ago

说错了，应该是DataParallel的dim这个参数：https://pytorch.org/docs/1.0.0/nn.html?highlight=dataparallel#torch.nn.DataParallel

chenpe32cp commented 5 years ago

说错了，应该是DataParallel的dim这个参数：https://pytorch.org/docs/1.0.0/nn.html?highlight=dataparallel#torch.nn.DataParallel 这个参数我调成1试过，但是输入数据的形式是(batch_size,T,D),也就是对应DataParallel中的dim=0啊。现在我怀疑出错的地方应该是，将一个batch的数据分到不同的GPU上时，这时每个GPU对应的max_len是不一样的，所以导致最终合并的时候报错，但还没找到解决方案，囧。。。

chenpe32cp commented 5 years ago

torch.nn.DataParallel

当使用两块GPU时，我定位到bug是DataParallel在将batch_size=18的数据分为两部分时，每一块分别为9，但是两块GPU的序列长度不一样，分别是35和27，这个请问该如何解决呢？, padded_input's shape: torch.Size([18, 292, 320]) input_lengths's shape: torch.Size([18]) padded_target's shape: torch.Size([18, 34]) max_len: 35 pad: torch.Size([9, 35]) max_len: 35 pad: torch.Size([9, 35]) max_len: 27 pad: torch.Size([9, 27]) max_len: 27 pad: torch.Size([9, 27]) Traceback (most recent call last): File "/home/cp/Speech-Transformer-master/egs/aishell/../../src/bin/train.py", line 168, in main(args) File "/home/cp/Speech-Transformer-master/egs/aishell/../../src/bin/train.py", line 162, in main solver.train() File "/home/cp/Speech-Transformer-master/src/solver/solver.py", line 82, in train tr_avg_loss = self._run_one_epoch(epoch) File "/home/cp/Speech-Transformer-master/src/solver/solver.py", line 169, in _run_one_epoch pred, gold = self.model(padded_input, input_lengths, padded_target) File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, *kwargs) File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 124, in forward return self.gather(outputs, self.output_device) File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 136, in gather return gather(outputs, output_device, dim=self.dim) File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather return gather_map(outputs) File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map return type(out)(map(gather_map, zip(outputs))) File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map return Gather.apply(target_device, dim, *outputs) File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 65, in forward return comm.gather(inputs, ctx.dim, ctx.target_device) File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/cuda/comm.py", line 160, in gather return torch._C._gather(tensors, dim, destination) RuntimeError: Gather got an input of invalid size: got [9, 35, 9961], but expected [9, 27, 9961] (gather at torch/csrc/cuda/comm.cpp:183)

counter0 commented 5 years ago

我用的别的方法进行多gpu训练的，用了horovod，你可以研究下，改动代码不多

chenpe32cp commented 5 years ago

我用的别的方法进行多gpu训练的，用了horovod，你可以研究下，改动代码不多能不能分享一下你的多GPU训练部分的代码，让我参考一下啊，我调试了好久还没调出来

kaituoxu commented 5 years ago

说错了，应该是DataParallel的dim这个参数：https://pytorch.org/docs/1.0.0/nn.html?highlight=dataparallel#torch.nn.DataParallel 这个参数我调成1试过，但是输入数据的形式是(batch_size,T,D),也就是对应DataParallel中的dim=0啊。现在我怀疑出错的地方应该是，将一个batch的数据分到不同的GPU上时，这时每个GPU对应的max_len是不一样的，所以导致最终合并的时候报错，但还没找到解决方案，囧。。。

Here is an example solution: https://github.com/kaituoxu/Listen-Attend-Spell/blob/master/src/models/encoder.py#L34-L42

xdcesc commented 5 years ago

我用的别的方法进行多gpu训练的，用了horovod，你可以研究下，改动代码不多

Could you please share the code?

kaituoxu / Speech-Transformer

多gpu训练的时候改变run.sh里的ngpu的值并不能有效果，这是为啥 #2

torch.nn.DataParallel