kaituoxu / Speech-Transformer

A PyTorch implementation of Speech Transformer, an End-to-End ASR with Transformer network on Mandarin Chinese.
768 stars 195 forks source link

多gpu训练的时候改变run.sh里的ngpu的值并不能有效果,这是为啥 #2

Closed counter0 closed 5 years ago

kaituoxu commented 5 years ago
  1. 那个ngpu默认设为1,改为其他数字没有影响
  2. 多GPU暂时还不支持,如果有需要的话,可以使用PyTorch的nn.DataParallel()
chenpe32cp commented 5 years ago

我按照网上的方法加上DataPallel后,报Gather got an input of invalid size: got [10,35,9961], but expected [10,26,9961] 的错误,没找到啥问题,请问您知道这大概是啥问题导致的吗?

kaituoxu commented 5 years ago

@chenpe32cp 设置batch_first了吗

chenpe32cp commented 5 years ago

请问batch_first这个参数应该在什么地方设置呢?

---原始邮件--- 发件人: "Kaituo XU 许开拓"notifications@github.com 发送时间: 2019年7月15日(星期一) 上午10:17 收件人: "kaituoxu/Speech-Transformer"Speech-Transformer@noreply.github.com; 抄送: "Mention"mention@noreply.github.com;"ChenPeng"chenpeng0538@qq.com; 主题: Re: [kaituoxu/Speech-Transformer] 多gpu训练的时候改变run.sh里的ngpu的值并不能有效果,这是为啥 (#2)

@chenpe32cp 设置batch_first了吗

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

kaituoxu commented 5 years ago

说错了,应该是DataParallel的dim这个参数:https://pytorch.org/docs/1.0.0/nn.html?highlight=dataparallel#torch.nn.DataParallel

chenpe32cp commented 5 years ago

说错了,应该是DataParallel的dim这个参数:https://pytorch.org/docs/1.0.0/nn.html?highlight=dataparallel#torch.nn.DataParallel 这个参数我调成1试过,但是输入数据的形式是(batch_size,T,D),也就是对应DataParallel中的dim=0啊。现在我怀疑出错的地方应该是,将一个batch的数据分到不同的GPU上时,这时每个GPU对应的max_len是不一样的,所以导致最终合并的时候报错,但还没找到解决方案,囧。。。

chenpe32cp commented 5 years ago

torch.nn.DataParallel

当使用两块GPU时,我定位到bug是DataParallel在将batch_size=18的数据分为两部分时,每一块分别为9,但是两块GPU的序列长度不一样,分别是35和27,这个请问该如何解决呢?, padded_input's shape: torch.Size([18, 292, 320]) input_lengths's shape: torch.Size([18]) padded_target's shape: torch.Size([18, 34]) max_len: 35 pad: torch.Size([9, 35]) max_len: 35 pad: torch.Size([9, 35]) max_len: 27 pad: torch.Size([9, 27]) max_len: 27 pad: torch.Size([9, 27]) Traceback (most recent call last): File "/home/cp/Speech-Transformer-master/egs/aishell/../../src/bin/train.py", line 168, in main(args) File "/home/cp/Speech-Transformer-master/egs/aishell/../../src/bin/train.py", line 162, in main solver.train() File "/home/cp/Speech-Transformer-master/src/solver/solver.py", line 82, in train tr_avg_loss = self._run_one_epoch(epoch) File "/home/cp/Speech-Transformer-master/src/solver/solver.py", line 169, in _run_one_epoch pred, gold = self.model(padded_input, input_lengths, padded_target) File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, *kwargs) File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 124, in forward return self.gather(outputs, self.output_device) File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 136, in gather return gather(outputs, output_device, dim=self.dim) File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather return gather_map(outputs) File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map return type(out)(map(gather_map, zip(outputs))) File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map return Gather.apply(target_device, dim, *outputs) File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 65, in forward return comm.gather(inputs, ctx.dim, ctx.target_device) File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/cuda/comm.py", line 160, in gather return torch._C._gather(tensors, dim, destination) RuntimeError: Gather got an input of invalid size: got [9, 35, 9961], but expected [9, 27, 9961] (gather at torch/csrc/cuda/comm.cpp:183)

counter0 commented 5 years ago

我用的别的方法进行多gpu训练的,用了horovod,你可以研究下,改动代码不多

chenpe32cp commented 5 years ago

我用的别的方法进行多gpu训练的,用了horovod,你可以研究下,改动代码不多 能不能分享一下你的多GPU训练部分的代码,让我参考一下啊,我调试了好久还没调出来

kaituoxu commented 5 years ago

说错了,应该是DataParallel的dim这个参数:https://pytorch.org/docs/1.0.0/nn.html?highlight=dataparallel#torch.nn.DataParallel 这个参数我调成1试过,但是输入数据的形式是(batch_size,T,D),也就是对应DataParallel中的dim=0啊。现在我怀疑出错的地方应该是,将一个batch的数据分到不同的GPU上时,这时每个GPU对应的max_len是不一样的,所以导致最终合并的时候报错,但还没找到解决方案,囧。。。

Here is an example solution: https://github.com/kaituoxu/Listen-Attend-Spell/blob/master/src/models/encoder.py#L34-L42

xdcesc commented 5 years ago

我用的别的方法进行多gpu训练的,用了horovod,你可以研究下,改动代码不多

Could you please share the code?