关于二分类的在线学习和预测问题

sm807983636 commented 4 years ago

当我进行二分类在线训练和预测任务的时候，出现了如下报错： RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered 说是使用BCELoss的时候，tensor范围超出了[0, 1]，我也上网查找了许多方法，并且对https://github.com/motefly/DeepGBM/blob/master/experiments/helper.py中TrainWithLog函数的outputs，以及https://github.com/motefly/DeepGBM/blob/master/experiments/models/components.py中true_loss函数的out分别进行了修改，但是运行结果也还是报相同的错误，想请问一下，到底是哪出了问题？

sm807983636 commented 4 years ago

具体的报错内容如下： C:/w/1/s/windows/pytorch/aten/src/THCUNN/BCECriterion.cu:42: block: [0,0,0], thr ead: [215,0,0] Assertion input >= 0. && input <= 1. failed. C:/w/1/s/windows/pytorch/aten/src/THCUNN/BCECriterion.cu:42: block: [0,0,0], thr ead: [216,0,0] Assertion input >= 0. && input <= 1. failed. ... C:/w/1/s/windows/pytorch/aten/src/THCUNN/BCECriterion.cu:42: block: [0,0,0], thr ead: [190,0,0] Assertion input >= 0. && input <= 1. failed. C:/w/1/s/windows/pytorch/aten/src/THCUNN/BCECriterion.cu:42: block: [0,0,0], thr ead: [191,0,0] Assertion input >= 0. && input <= 1. failed. Traceback (most recent call last): File "F:/Data-group/experiments/online_main.py", line 493, in deepgbm_online() File "F:/Data-group/experiments/online_main.py", line 287, in deepgbm_online fitted_model, opt, metric = train_DEEPGBM(args, num_data, cate_data, plot_ti tle, key="") File "F:\Data-group\experiments\train_models.py", line 171, in train_DEEPGBM args.emb_epoch, args.batch_size, n_output, key+"emb-") File "F:\Data-group\experiments\helper.py", line 139, in TrainWithLog loss_val = model.true_loss(outputs[0], targets) File "F:\Data-group\experiments\models\components.py", line 272, in true_loss return self.criterion(out, target) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-p ackages\torch\nn\modules\module.py", line 541, in call result = self.forward(*input, **kwargs) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-p ackages\torch\nn\modules\loss.py", line 498, in forward return F.binary_cross_entropy(input, target, weight=self.weight, reduction=s elf.reduction) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-p ackages\torch\nn\functional.py", line 2065, in binary_cross_entropy input, target, weight, reduction_enum) RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered

sm807983636 commented 4 years ago

我还想请问下输出界面“Model Interpreting...”下面一行的向量输出语句是在处于代码的哪里？成功运行时，向量各项元素均大于0，而运行失败时出现了0元素。

motefly commented 4 years ago

你可以尝试pdb单步调试或打印出参与BCELoss计算的tensor，检查是否真的如报错所示超过区间；print可以用查找定位，可能是 https://github.com/motefly/DeepGBM/blob/master/tree_model_interpreter.py#L110 。

sm807983636 commented 4 years ago

您好，我修改了nslices便可成功运行，请问该参数是否跟数据集的大小有关？

motefly commented 4 years ago

该参数用于分组，组数应根据树的数量进行调整。

sm807983636 commented 4 years ago

感谢解答！

motefly / DeepGBM

关于二分类的在线学习和预测问题 #23