median-research-group / LibMTL

A PyTorch Library for Multi-Task Learning
MIT License
1.94k stars 181 forks source link

更换损失函数时的两个问题 #30

Closed Luobupi closed 1 year ago

Luobupi commented 1 year ago

您好,我在更换损失函数的时候有两个问题想请教一下,谢谢!

目前使用的版本是1.1.6

  1. 把损失函数从CELoss换成KLDivLoss的时候会出现维度对不上问题

    File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/LibMTL/core/trainer.py", line 461, in train
     train_losses[tn] = self._compute_loss(
    File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/LibMTL/core/trainer.py", line 304, in _compute_loss
     train_losses = self.losses[task_name].update_loss(preds[task_name], gts)
    File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/LibMTL/loss/abstract_loss.py", line 59, in update_loss
     loss = self.compute_loss(pred, gt)
    File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/LibMTL/loss/KLDivLoss.py", line 19, in compute_loss
     loss = self.loss_fn(pred, gt)
    File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
     return forward_call(*input, **kwargs)
    File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 465, in forward
     return F.kl_div(input, target, reduction=self.reduction, log_target=self.log_target)
    File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/functional.py", line 2916, in kl_div
     reduced = torch.kl_div(input, target, reduction_enum, log_target=log_target)
    RuntimeError: The size of tensor a (64) must match the size of tensor b (31) at non-singleton dimension 1
  2. 在原有损失函数换成一个新的损失时会出现,inplace操作,但是在代码中好像没有出现inplace,如果还是计算交叉熵则不会有问题

修改的代码

decoder_soft_loss = nn.KLDivLoss(reduction="batchmean")(
                                 nn.functional.log_softmax(unlearned_decoder / 10.0, dim=1),
                                 nn.functional.softmax(init_decoder / 10.0, dim=1))

其中unlearned_decoder 是模型输出的pred,init_decoder 是初始化模型输出的pred

错误信息

  File "train_office.py", line 12, in <module>
    Officemodel.kd_train()
  File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/LibMTL/core/trainer.py", line 712, in kd_train
    w = self.model.backward(train_losses, **weighting_arg)
  File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/LibMTL/weighting/DWA.py", line 40, in backward
    loss.backward()
  File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 31]], which is output 0 of AsStridedBackward0, is at version 5; expected version 4 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
Baijiong-Lin commented 1 year ago
  1. nn.CrossEntropyLoss()和nn.KLDivLoss()对target/gt的维度要求是不同的,比如model输出的prediction的维度是64x31,其中64是batch size的维度,那么nn.CrossEntropyLoss()的target的维度是64,而nn.KLDivLoss()的是64x31。具体可以参考pytorch官方文档中关于nn.CrossEntropyLoss()nn.KLDivLoss()的说明和例子。
  2. 从你提供的信息中,我没找到bug在哪==。
Luobupi commented 1 year ago

我换过几个损失函数的计算方式,都显示在DWA loss.backward()的时候,说在进行反向传播时[512,31]这个向量里面有值发生了inplace操作,但是我并没有在损失函数中有类似+=或者类似的inplace操作

Baijiong-Lin commented 1 year ago

“[512,31]这个向量”, 这个向量是哪个

Luobupi commented 1 year ago

decoder定义为linear,在office-31数据集中,就是把512维的向量映射到31个类别中,所以相当于给encoder的输出结果乘以[512,31]的矩阵向量。 模型在反向传播时,对这个矩阵向量的权值进行更新

Baijiong-Lin commented 1 year ago

单靠这样的描述很难找到bug,如果方便还是麻烦你提供一下可复现这个bug的代码吧

Luobupi commented 1 year ago

谢谢您的帮助,通过多次调试,发现是我的代码在加载模型参数时的异常导致的错误,已经解决了,不好意思,耽误您的时间了! 祝您科研顺利,工作愉快!