Closed A11en0 closed 1 year ago
There is one more thing I want to ask, in the original implementation of MGDA, the model['rep'](images, mask) sometimes appears inside the loop, and sometimes outside the loop, is there any difference here? How does your implementation handle this?
Hi, thanks for your wonderful project. There are some questions I want to confirm when I apply the MGDA weighting method, could you please give me an answer, thanks!
- what's the self.rep_tasks?
- what's the rep_grad?
For the above two problems, I try to give my answer, the first is the representation generated by the representation layer (sharing parameters), and the second is whether using the gradients of representations.
In this case, my third problem is what is the purpose of the variable rep_grad when it is in MGDA?
It's used to implement the MGDA-UB? I realize that it will save the gradients of self.rep_tasks in the function of _compute_grad() in abstract_weighting.py, so I made such an assumption.
I'm a little bit confused about these technique detail, hope you can help me, thanks again!
You're right. rep
denotes the representation generated by self.encoder (sharing parameters). rep_grad
means using the gradient of representation instead of the sharing parameters to represent the task gradients, which is used to implement the MGDA-UB. Please refer to Docs and Section 3.3 of MGDA for more details.
There is one more thing I want to ask, in the original implementation of MGDA, the model['rep'](images, mask) sometimes appears inside the loop, and sometimes outside the loop, is there any difference here? How does your implementation handle this?
There are two different cases in MTL, called single-input and multi-input problems. In the multi-input problem, each task has its own input data, thus there are different rep
among all tasks. While in the single-input problem, all tasks share the same rep
, thus we can compute the shared rep
outside the task loop (like the third figure you provided). You can learn more details from Docs (especially figure 1).
The second figure would like to compute the gradients of the parameters in model['rep'] (rather than the representations) for each task. Note that those parameters in model['rep'] are involved in the forward-loss-backward process for all tasks, thus we need to forward model['rep'] every time.
The first figure would like to compute the gradients of the shared representation for each task. Note that the computational graph is split into two parts by copying rep
into another required-grad variable (rep_variable
), thus loss.backward()
does not calculate the gradients of parameters in model['rep']. Putting rep_variable
into the forward process for each task, we can compute its gradient for each task (it is similar to putting model['rep'] into the forward process every time in the second figure). Note that the gradient of rep
must be set to zero before the next backward operation. This is also the implementation of rep_grad
in our LibMTL
. You can learn more details about this from Docs.
Thanks for your quick and rigorous reply! I have understood the first part's issues, but I am still confused about the second part. I can understand what you're saying about single-input and multi-input problems, maybe what you're talking about is hard sharing vs. soft sharing. But I think we can't have multi-input(figure 1, 3) and single-input(figure 2) in the same problem.
I guess it may be a bug, an issue from the original repo of MGDA: https://github.com/isl-org/MultiObjectiveOptimization/issues/12
I can understand what you're saying about single-input and multi-input problems, maybe what you're talking about is hard sharing vs. soft sharing.
No. The single-input and multi-input problems do not depend on what MTL architectures you use. For example, given a face image as input, we would like to predict the age and gender simultaneously, which is a single-input MTL problem. If we would like to train MNIST and CIFAR10 together, which is a multi-input MTL problem. Both hard sharing and soft sharing patterns can deal with single-input and multi-input problems.
But I think we can't have multi-input(figure 1, 3) and single-input(figure 2) in the same problem.
Actually, the implementation of the original repo of MGDA only considers the single-input problem. The experiments in the MGDA paper are conducted on three single-input problems, i.e., MultiMNIST, CelebA, and Cityscapes.
The implementation of MGDA in LibMTL
supports the multi-input case. You just need to set the argument multi_input
as True
.
I guess it may be a bug, an issue from the original repo of MGDA: https://github.com/isl-org/MultiObjectiveOptimization/issues/12
I think this issue discusses another problem. Specifically, the MGDA algorithm takes the task gradients as input and outputs weights for each task, and those weights determine the update of task-sharing parameters. This issue is talking about whether those weights generated by MGDA affect the update of task-specific parameters.
Thanks! I just figure it out and you replied to me! Your thinking is quite clear. I have read some papers about MTL before, but I didn't see anything about single-input and multi-input (maybe I read too little), but I think it's indeed a significant distinction.
Sorry for my follow-up:
Actually, the implementation of the original repo of MGDA only considers the single-input problem. The experiments in the MGDA paper are conducted on three single-input problems, i.e., MultiMNIST, CelebA, and Cityscapes.
If MGDA is single-input, I think it is equivalent to putting model['rep'](images, mask)
inside or outside the loop in figure 1 and figure 3, is that right?
Note that the computational graph is split into two parts by copying
rep
into another required-grad variable (rep_variable
), thusloss.backward()
does not calculate the gradients of parameters inmodel['rep']
.
loss.backward()
does not calculate the gradients of parameters in model['rep']
, is this because we just choose rep_variable
to calculate the weight in the following refer lines? Or do we block model['rep']
in loss.backward()
?
https://github.com/isl-org/MultiObjectiveOptimization/blob/master/multi_task/train_multi_task.py#L136-L141
If MGDA is single-input, I think it is equivalent to putting model['rep'](images, mask) inside or outside the loop in figure 1 and figure 3, is that right?
Yep. But it will increase unnecessary computational costs if you put it inside the loop.
loss.backward() does not calculate the gradients of parameters in model['rep'], is this because we just choose rep_variable to calculate the weight in the following refer lines? Or do we block model['rep'] in loss.backward()? https://github.com/isl-org/MultiObjectiveOptimization/blob/master/multi_task/train_multi_task.py#L136-L141
It is because .data.clone()
in Line 122 and Line 125 splits the computational graph. You can try the example as follows.
import torch
a = torch.rand(2, 3)
a.requires_grad = True
b = torch.sigmoid(a)
c = b.data.clone()
c.requires_grad = True
d = torch.relu(c).sum()
d.backward()
print(a.grad)
This example will print None
. But if you use
c = b
instead of
c = b.data.clone()
c.requires_grad = True
Then it can output the gradient of a
.
wonderful! This example is wonderful, it takes away my confusion these days, thanks a lot!!!
Sorry, I have another problem with the MGDA-UB implementation. Why only backward on the representation, is that OK? I wonder if it's fine without updating the gradient of the decoders.
I realize that the decoders have been updated in _compute_grad(), meanwhile, I checked the gradient flow graph below, but how does the gradient of the decoder aggregated shown by the green lines?
I got it.
Hi, thanks for your wonderful project. There are some questions I want to confirm when I apply the MGDA weighting method, could you please give me an answer, thanks!
For the above two problems, I try to give my answer, the first is the representation generated by the representation layer (sharing parameters), and the second is whether using the gradients of representations.
In this case, my third problem is what is the purpose of the variable rep_grad when it is in MGDA?
It's used to implement the MGDA-UB? I realize that it will save the gradients of self.rep_tasks in the function of _compute_grad() in abstract_weighting.py, so I made such an assumption.
I'm a little bit confused about these technique detail, hope you can help me, thanks again!