j-river / svtr-pytorch

pytorch version of svtr model
19 stars 3 forks source link

用2个显卡训练模型的时候报错 , #2

Open apple2333cream opened 1 year ago

apple2333cream commented 1 year ago

File "/home/wzp/project/ppocr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/wzp/project/torch/CRNN_svtr/lib/models/crnn_svrt.py", line 205, in forward attn += self.mask RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

j-river commented 1 year ago

多卡训练也是ok的,你看下是不是多卡训练的程序写的有问题

apple2333cream commented 1 year ago

嗯 多谢大佬解答! 问题已解决

apple2333cream commented 1 year ago

还想请教下大佬, 你是和paddleocr一样用了ctcloss和sarloss么,sarloss这部分是怎么实现的呀?

j-river commented 1 year ago

我训练的时候只用了ctcloss,效果还不错。sarloss是什么形式?我还没用过

apple2333cream commented 1 year ago

paddleocr的svrt用了sarloss+ctcloss , sarloss其实就是ce,但没搞懂他是怎么对齐后计算ce的...

j-river commented 1 year ago

对齐这一块,简单一点的话,是不是可以只考虑predict和label长度一致时计算ce loss;不一致可以直接设置为0。不过我没试过效果。或者你也可以到paddleocr上提个issue,看看具体的做法

wzl639 commented 1 year ago

File "/home/wzp/project/ppocr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/wzp/project/torch/CRNN_svtr/lib/models/crnn_svrt.py", line 205, in forward attn += self.mask RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

大佬,我也遇到同样的问题,请问是什么原因呢,怎么解决呢?能否指点下,感谢。

fourierer commented 9 months ago

您好,我这边也出现相同的错误,我的多卡训练代码是这样写的,我发现当x = torch.rand([4, 3, 32, 640]).to(to_use_device)第一维batch为4的时候就会报错,但是把batch改成1就不会报错,您知道是怎么回事么 image

fourierer commented 9 months ago

我的解决方式是这样的,使用了nn.DataParallel后模型参数都到了cuda0上;但是由于并行化数据分布在各个gpu上,如果cuda1上的数据过模型的时候,模型中有些参数在cuda0上,所以发生了报错,可以获取数据的所在的gpu号,然后将模型参数实时放到对应的gpu上,就可以将模型参数与数据在相同gpu上进行计算,代码如下;(欢迎各位大佬指正) image