motefly / DeepGBM

SIGKDD'2019: DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks
647 stars 135 forks source link

关于二分类的在线学习和预测问题 #23

Closed sm807983636 closed 4 years ago

sm807983636 commented 4 years ago

当我进行二分类在线训练和预测任务的时候,出现了如下报错: RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered 说是使用BCELoss的时候,tensor范围超出了[0, 1],我也上网查找了许多方法,并且对https://github.com/motefly/DeepGBM/blob/master/experiments/helper.py中TrainWithLog函数的outputs,以及https://github.com/motefly/DeepGBM/blob/master/experiments/models/components.py中true_loss函数的out分别进行了修改,但是运行结果也还是报相同的错误,想请问一下,到底是哪出了问题?

sm807983636 commented 4 years ago

具体的报错内容如下: C:/w/1/s/windows/pytorch/aten/src/THCUNN/BCECriterion.cu:42: block: [0,0,0], thr ead: [215,0,0] Assertion input >= 0. && input <= 1. failed. C:/w/1/s/windows/pytorch/aten/src/THCUNN/BCECriterion.cu:42: block: [0,0,0], thr ead: [216,0,0] Assertion input >= 0. && input <= 1. failed. ... C:/w/1/s/windows/pytorch/aten/src/THCUNN/BCECriterion.cu:42: block: [0,0,0], thr ead: [190,0,0] Assertion input >= 0. && input <= 1. failed. C:/w/1/s/windows/pytorch/aten/src/THCUNN/BCECriterion.cu:42: block: [0,0,0], thr ead: [191,0,0] Assertion input >= 0. && input <= 1. failed. Traceback (most recent call last): File "F:/Data-group/experiments/online_main.py", line 493, in deepgbm_online() File "F:/Data-group/experiments/online_main.py", line 287, in deepgbm_online fitted_model, opt, metric = train_DEEPGBM(args, num_data, cate_data, plot_ti tle, key="") File "F:\Data-group\experiments\train_models.py", line 171, in train_DEEPGBM args.emb_epoch, args.batch_size, n_output, key+"emb-") File "F:\Data-group\experiments\helper.py", line 139, in TrainWithLog loss_val = model.true_loss(outputs[0], targets) File "F:\Data-group\experiments\models\components.py", line 272, in true_loss return self.criterion(out, target) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-p ackages\torch\nn\modules\module.py", line 541, in call result = self.forward(*input, **kwargs) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-p ackages\torch\nn\modules\loss.py", line 498, in forward return F.binary_cross_entropy(input, target, weight=self.weight, reduction=s elf.reduction) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-p ackages\torch\nn\functional.py", line 2065, in binary_cross_entropy input, target, weight, reduction_enum) RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered

sm807983636 commented 4 years ago

我还想请问下输出界面“Model Interpreting...”下面一行的向量输出语句是在处于代码的哪里?成功运行时,向量各项元素均大于0,而运行失败时出现了0元素。

motefly commented 4 years ago

你可以尝试pdb单步调试或打印出参与BCELoss计算的tensor,检查是否真的如报错所示超过区间;print可以用查找定位,可能是 https://github.com/motefly/DeepGBM/blob/master/tree_model_interpreter.py#L110

sm807983636 commented 4 years ago

您好,我修改了nslices便可成功运行,请问该参数是否跟数据集的大小有关?

motefly commented 4 years ago

该参数用于分组,组数应根据树的数量进行调整。

sm807983636 commented 4 years ago

感谢解答!