median-research-group / LibMTL

A PyTorch Library for Multi-Task Learning
MIT License
1.95k stars 182 forks source link

About MGDA implementation, some details I want to confirm. #22

Closed A11en0 closed 1 year ago

A11en0 commented 1 year ago

Hi, thanks for your wonderful project. There are some questions I want to confirm when I apply the MGDA weighting method, could you please give me an answer, thanks!

  1. what's the self.rep_tasks?
  2. what's the rep_grad?

For the above two problems, I try to give my answer, the first is the representation generated by the representation layer (sharing parameters), and the second is whether using the gradients of representations.

In this case, my third problem is what is the purpose of the variable rep_grad when it is in MGDA?

It's used to implement the MGDA-UB? I realize that it will save the gradients of self.rep_tasks in the function of _compute_grad() in abstract_weighting.py, so I made such an assumption.

I'm a little bit confused about these technique detail, hope you can help me, thanks again!

A11en0 commented 1 year ago

image image image

There is one more thing I want to ask, in the original implementation of MGDA, the model['rep'](images, mask) sometimes appears inside the loop, and sometimes outside the loop, is there any difference here? How does your implementation handle this?

Baijiong-Lin commented 1 year ago

Hi, thanks for your wonderful project. There are some questions I want to confirm when I apply the MGDA weighting method, could you please give me an answer, thanks!

  1. what's the self.rep_tasks?
  2. what's the rep_grad?

For the above two problems, I try to give my answer, the first is the representation generated by the representation layer (sharing parameters), and the second is whether using the gradients of representations.

In this case, my third problem is what is the purpose of the variable rep_grad when it is in MGDA?

It's used to implement the MGDA-UB? I realize that it will save the gradients of self.rep_tasks in the function of _compute_grad() in abstract_weighting.py, so I made such an assumption.

I'm a little bit confused about these technique detail, hope you can help me, thanks again!

You're right. rep denotes the representation generated by self.encoder (sharing parameters). rep_grad means using the gradient of representation instead of the sharing parameters to represent the task gradients, which is used to implement the MGDA-UB. Please refer to Docs and Section 3.3 of MGDA for more details.

Baijiong-Lin commented 1 year ago

There is one more thing I want to ask, in the original implementation of MGDA, the model['rep'](images, mask) sometimes appears inside the loop, and sometimes outside the loop, is there any difference here? How does your implementation handle this?

There are two different cases in MTL, called single-input and multi-input problems. In the multi-input problem, each task has its own input data, thus there are different rep among all tasks. While in the single-input problem, all tasks share the same rep, thus we can compute the shared rep outside the task loop (like the third figure you provided). You can learn more details from Docs (especially figure 1).

The second figure would like to compute the gradients of the parameters in model['rep'] (rather than the representations) for each task. Note that those parameters in model['rep'] are involved in the forward-loss-backward process for all tasks, thus we need to forward model['rep'] every time.

The first figure would like to compute the gradients of the shared representation for each task. Note that the computational graph is split into two parts by copying rep into another required-grad variable (rep_variable), thus loss.backward() does not calculate the gradients of parameters in model['rep']. Putting rep_variable into the forward process for each task, we can compute its gradient for each task (it is similar to putting model['rep'] into the forward process every time in the second figure). Note that the gradient of rep must be set to zero before the next backward operation. This is also the implementation of rep_grad in our LibMTL. You can learn more details about this from Docs.

A11en0 commented 1 year ago

Thanks for your quick and rigorous reply! I have understood the first part's issues, but I am still confused about the second part. I can understand what you're saying about single-input and multi-input problems, maybe what you're talking about is hard sharing vs. soft sharing. But I think we can't have multi-input(figure 1, 3) and single-input(figure 2) in the same problem.

I guess it may be a bug, an issue from the original repo of MGDA: https://github.com/isl-org/MultiObjectiveOptimization/issues/12

Baijiong-Lin commented 1 year ago

I can understand what you're saying about single-input and multi-input problems, maybe what you're talking about is hard sharing vs. soft sharing.

No. The single-input and multi-input problems do not depend on what MTL architectures you use. For example, given a face image as input, we would like to predict the age and gender simultaneously, which is a single-input MTL problem. If we would like to train MNIST and CIFAR10 together, which is a multi-input MTL problem. Both hard sharing and soft sharing patterns can deal with single-input and multi-input problems.

But I think we can't have multi-input(figure 1, 3) and single-input(figure 2) in the same problem.

Actually, the implementation of the original repo of MGDA only considers the single-input problem. The experiments in the MGDA paper are conducted on three single-input problems, i.e., MultiMNIST, CelebA, and Cityscapes.

The implementation of MGDA in LibMTL supports the multi-input case. You just need to set the argument multi_input as True.

I guess it may be a bug, an issue from the original repo of MGDA: https://github.com/isl-org/MultiObjectiveOptimization/issues/12

I think this issue discusses another problem. Specifically, the MGDA algorithm takes the task gradients as input and outputs weights for each task, and those weights determine the update of task-sharing parameters. This issue is talking about whether those weights generated by MGDA affect the update of task-specific parameters.

A11en0 commented 1 year ago

Thanks! I just figure it out and you replied to me! Your thinking is quite clear. I have read some papers about MTL before, but I didn't see anything about single-input and multi-input (maybe I read too little), but I think it's indeed a significant distinction.

A11en0 commented 1 year ago

Sorry for my follow-up:

Actually, the implementation of the original repo of MGDA only considers the single-input problem. The experiments in the MGDA paper are conducted on three single-input problems, i.e., MultiMNIST, CelebA, and Cityscapes.

If MGDA is single-input, I think it is equivalent to putting model['rep'](images, mask) inside or outside the loop in figure 1 and figure 3, is that right?

Note that the computational graph is split into two parts by copying rep into another required-grad variable (rep_variable), thus loss.backward() does not calculate the gradients of parameters in model['rep'].

loss.backward() does not calculate the gradients of parameters in model['rep'], is this because we just choose rep_variable to calculate the weight in the following refer lines? Or do we block model['rep'] in loss.backward()? https://github.com/isl-org/MultiObjectiveOptimization/blob/master/multi_task/train_multi_task.py#L136-L141

Baijiong-Lin commented 1 year ago

If MGDA is single-input, I think it is equivalent to putting model['rep'](images, mask) inside or outside the loop in figure 1 and figure 3, is that right?

Yep. But it will increase unnecessary computational costs if you put it inside the loop.

loss.backward() does not calculate the gradients of parameters in model['rep'], is this because we just choose rep_variable to calculate the weight in the following refer lines? Or do we block model['rep'] in loss.backward()? https://github.com/isl-org/MultiObjectiveOptimization/blob/master/multi_task/train_multi_task.py#L136-L141

It is because .data.clone() in Line 122 and Line 125 splits the computational graph. You can try the example as follows.

import torch
a = torch.rand(2, 3)
a.requires_grad = True
b = torch.sigmoid(a)
c = b.data.clone()
c.requires_grad = True
d = torch.relu(c).sum()
d.backward()
print(a.grad)

This example will print None. But if you use

c = b

instead of

c = b.data.clone()
c.requires_grad = True

Then it can output the gradient of a.

A11en0 commented 1 year ago

wonderful! This example is wonderful, it takes away my confusion these days, thanks a lot!!!

A11en0 commented 1 year ago

image

Sorry, I have another problem with the MGDA-UB implementation. Why only backward on the representation, is that OK? I wonder if it's fine without updating the gradient of the decoders.

A11en0 commented 1 year ago

I realize that the decoders have been updated in _compute_grad(), meanwhile, I checked the gradient flow graph below, but how does the gradient of the decoder aggregated shown by the green lines? image

A11en0 commented 1 year ago

I got it.