deepFM 多gpu报错 - Githubissues

GodeWithWind commented 1 year ago

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! 下面是代码

model = DeepFM(linear_feature_columns=linear_feature_columns, dnn_feature_columns=dnn_feature_columns, task='binary', device=device, gpus=[0, 1], dnn_dropout=0.5)

zanshuxun commented 1 year ago

报错的具体堆栈看下

zanshuxun commented 1 year ago

报错堆栈

b1b43dc2277dcf56da35fef917f2a523

原因分析

这个bug是我们之前为了解决https://github.com/shenweichen/DeepCTR-Torch/issues/192 发布的https://github.com/shenweichen/DeepCTR-Torch/commit/2d56269259197ad67360a21fce0477c57b454818 这次改动中引入的，当时考虑到sparse_embedding_list可能为空，因此将sparse_embedding_list[0].device改为了self.device。但是多gpu训练时self.device只是gpus[0]这一个gpu，所以会导致linear_logit和sparse_feat_logit不在同一个gpu上，继而引发如上报错。

解决方案

多gpu训练时，有两种方法可以获取当前所在的gpu：

通过输入数据来获取，需要先对输入特征进行判空。如下图（具体代码见https://github.com/shenweichen/DeepCTR-Torch/commit/80fb34430036bfa81174ae461b36434cab11bea1 ）：
通过某个已注册的模型参数来获取，这里可以用self.weight.device，如下图：

但是当输入的dense_feature_columns为空时（例如AFM模型，或其他用户未使用dense feature的情况），不会创建self.weight参数，需要先进行判断再使用。

我们准备采用方案1，会在后续的发布中合并此处改动，感谢您的提问。

iamsile commented 10 months ago

@zanshuxun 您好，请问这个改动什么时候可以发布哈？

shenweichen / DeepCTR-Torch

deepFM 多gpu报错 #261

报错堆栈

原因分析

解决方案