kekmodel / MPL-pytorch

Unofficial PyTorch implementation of "Meta Pseudo Labels"
383 stars 69 forks source link

dot_product = s_loss_old - s_loss_new but s_loss_new - s_loss_old? #6

Open zwd973 opened 3 years ago

zwd973 commented 3 years ago

Hello, thanking for your pytorch implement of MPL. I think the dot_prodoct should be s_loss_old - s_loss_new but s_loss_new - s_loss_old for the reason here image
just flip a coin
Am I wrong?

dagleaves commented 3 years ago

Where is that from?

According to the original code, it is s_loss_new - s_loss_old: Link dot_product = cross_entropy['s_on_l_new'] - shadow

where shadow is defined as: Link

shadow = tf.get_variable(name='cross_entropy_old', shape=[], trainable=False, dtype=tf.float32)
shadow_update = tf.assign(shadow, cross_entropy['s_on_l_old'])
monney commented 3 years ago

@zwd973 I can't fully follow your derivation. But this is the formula used in the original code, as stated above. I believe it is correct as is. Here's the derivation:

First order Taylor: f(x) = f(a) + f'(x)(x-a)

let f(x) be the cross entropy function, x is the new parameters let a be x+h, or the old parameters (h is the gradient on the unlabeled data for the old parameters) f(x) = f(x+h) + f'(x)(x-x+h) f(x) - f(x+h) = f'(x) h

this is new cross entropy minus old.

zwd973 commented 3 years ago

@zwd973 I can't fully follow your derivation. But this is the formula used in the original code, as stated above. I believe it is correct as is. Here's the derivation:

First order Taylor: f(x) = f(a) + f'(x)(x-a)

let f(x) be the cross entropy function, x is the new parameters let a be x+h, or the old parameters (h is the gradient on the unlabeled data for the old parameters) f(x) = f(x+h) + f'(x)(x-x+h) f(x) - f(x+h) = f'(x) h

this is new cross entropy minus old. year,but f(x) = f(x + h) + f'(x) (x-(x+h)) = f(x+h) + f'(x) (x-x-h) but f(x+h) + f'(x) * (x-x +h) isn't it?

monney commented 3 years ago

@zwd973 You're right, I missed a negative. Interesting. The original author's code is wrong here then

zwd973 commented 3 years ago

OK, thanks.

monney commented 3 years ago

@kekmodel this might be why it got worse? Though Im not sure how the author was able to replicate results

kekmodel commented 3 years ago

I agree. I thought that the sign changes in the process of calculating moving_dot_product. But I confirmed that it was not. I will test again. Thanks!

zwd973 commented 3 years ago

@kekmodel Hello, how about the new result?

kekmodel commented 3 years ago

Unfortunately, all test accuracy is about 94.4. The mpl loss doesn't seem to work. I'll have to wait until the author's code update.

monney commented 3 years ago

@kekmodel thats unfortunate to hear. But thank you for all your work thus far. The number of discrepancies in the original code make things quite difficult.

dgedon commented 3 years ago

If I am not mistaken, then the first order Taylor expansion goes as f(x) = f(a) + f'(a)(x-a). So there is f'(a) instead of f'(x). Then with the same notation as above: f(x) as cross-entropy, x as new parameters, a=x+h as old parameters, we get

f(x) = f(x+h) + f'(x+h)(x-(x+h)) 
     = f(x+h) - f'(x+h) h
f(x+h)-f(x) = f'(x+h) h

where h is described above as the gradient. This already is a problem since on the right hand side it is not the dot product between the gradient on the new parameters and the gradient on the old parameters. It is more the gradient on the old parameters squared, if I understand correctly.

From that perspective the first order Taylor approximation does not make sense. Can you confirm or tell me if and where I am wrong?

monney commented 3 years ago

If I am not mistaken, then the first order Taylor expansion goes as f(x) = f(a) + f'(a)(x-a). So there is f'(a) instead of f'(x). Then with the same notation as above: f(x) as cross-entropy, x as new parameters, a=x+h as old parameters, we get

f(x) = f(x+h) + f'(x+h)(x-(x+h)) 
     = f(x+h) - f'(x+h) h
f(x+h)-f(x) = f'(x+h) h

where h is described above as the gradient. This already is a problem since on the right hand side it is not the dot product between the gradient on the new parameters and the gradient on the old parameters. It is more the gradient on the old parameters squared, if I understand correctly.

From that perspective the first order Taylor approximation does not make sense. Can you confirm or tell me if and where I am wrong?

You’re correct from what I can see. Sorry, I did the derivation quickly and haphazardly, which is why it’s wrong lol.

This quantity still has meaning since h is the gradient produced by the loss on the unlabeled target and this is the loss on the labeled data. So we’re essentially trying to get the teacher to produce the same loss as if the student was training on labeled data, but this also doesn’t seem to be what was derived in the paper. There’s supposed to be a time offset

hyhieu commented 3 years ago

About your derivations. I do not see anything wrong with @dgedon's derivation.

f(x) = f(a) + f'(a)(x-a). So there is f'(a) instead of f'(x). Then with the same notation as above: f(x) as cross-entropy, x as new parameters, a=x+h as old parameters, we get

f(x) = f(x+h) + f'(x+h)(x-(x+h)) 
     = f(x+h) - f'(x+h) h
f(x+h)-f(x) = f'(x+h) h

Comparing this to my derivation below, it looks like the difference is in the very first place, where you start at f(x+h) while I start at f(x). Details are in the equations, but intuitively, I think Taylor expansion says that locally, functions behave linearly in their gradients' direction. That is why we can start at either f(x+h) which will lead to your derivation, or at f(x) which will lead to my derivation below.


About Taylor. My understanding is as follows. Using your notations, x is the new parameters, x+h is the old parameters, and h is the gradient computed at the old parameters (so h is used in order to go from x+h into x).

       f(x+h) = f(x) + f'(x) * h
f(x+h) - f(x) = f'(x) * h

image Corresponding this to Equation 12 in the paper which I copied above, x is \theta'_S (the red box) and h is the blue box (sorry, this is a different h from the scalar h in the screenshot).


About using soft labels. If you use soft labels, you do not even need Taylor or the log-gradient trick, because the entire process is differentiable and you can do some Hessian/Jacobian vector product tricks instead.

In my implementation for this, I created shadow variables that hold the student's parameters, then build a computational graph to compute the gradients of these shadow variables using tf.gradients. Then, I manually computed the derivative with respect to optimizers (note that everything we are discussing here are still subjected to the computations inside optimizers such as RMSProp or Momentum). From this values, you can follow this guide to compute the correct, non-approximated gradient for the teacher.

For ugly reasons (exceeding graph proto size limits, if you are curious), this implementation did not run with GShard which we used for model parallelism, so we decided to do approximation instead.


Update code. I got some insider push backs because I was trying to update the code and release the trained checkpoints at the same time. I apologize for the delay, and will try to push on this more.

dgedon commented 3 years ago

Thanks @hyhieu. # About Taylor expansion: It works out nicely when you start in your way. However, I have two follow-up remarks/question on it:

  1. With your derivation I think you have student loss old - student loss new, which is actually different to your implementation here, where you have loss new - loss old and it is different to this repository: https://github.com/kekmodel/MPL-pytorch/blob/bdedb8e2d2514d1aaee455a5a3a0668b6c3ed60b/main.py#L216 which is actually the main discussion point of this issue.
  2. When comparing f(x+h) - f(x) = f'(x) * h with (12) from your paper, then I assume one should take the pseudo labeled data (x_u, \hat{y}_u) for the 'student loss old' (and labeled data (x_l, y_l) for the 'student loss new'). However, in this repository's code it is the following. Is this a mistake or do I misinterpret the equations? https://github.com/kekmodel/MPL-pytorch/blob/bdedb8e2d2514d1aaee455a5a3a0668b6c3ed60b/main.py#L197

# About Soft Labels: I have to think this a bit more through. In your paper in (10) you have instead of a one-hot encoding with hard labels for \hat{y}_u just a 'smoothed' version when using soft labels. From this point I don't understand how this changes the derivation.

zxhuang97 commented 3 years ago

Thanks @hyhieu.

About Taylor expansion: It works out nicely when you start in your way. However, I have two follow-up remarks/question on it:

  1. With your derivation I think you have student loss old - student loss new, which is actually different to your implementation here, where you have loss new - loss old and it is different to this repository: https://github.com/kekmodel/MPL-pytorch/blob/bdedb8e2d2514d1aaee455a5a3a0668b6c3ed60b/main.py#L216

    which is actually the main discussion point of this issue.

  2. When comparing f(x+h) - f(x) = f'(x) * h with (12) from your paper, then I assume one should take the pseudo labeled data (x_u, \hat{y}_u) for the 'student loss old' (and labeled data (x_l, y_l) for the 'student loss new'). However, in this repository's code it is the following. Is this a mistake or do I misinterpret the equations? https://github.com/kekmodel/MPL-pytorch/blob/bdedb8e2d2514d1aaee455a5a3a0668b6c3ed60b/main.py#L197

About Soft Labels: I have to think this a bit more through. In your paper in (10) you have instead of a one-hot encoding with hard labels for \hat{y}_u just a 'smoothed' version when using soft labels. From this point I don't understand how this changes the derivation.

@dgedon For the second question, I think we are approximating the red box(gradients of loss on labeled data wrt updated parameters). So when using finite difference, we should use the same data(labeled data) with different parameters(old/new).

easonyang1996 commented 3 years ago

@kekmodel Hi, Thanks for the implementation! So, is it clear now which one is right? loss_new-loss_old or loss_old-loss_new?

Adamdad commented 3 years ago

image This might be a clearer derivation.

monney commented 3 years ago

I think that the correct formula is old-new based on the several derivations that have been done here. But, I don't think the MPL Loss really has an effect either way. From what I can tell, based on the experiments here, my own experiments, and the reference code being flipped but still replicating results.

I have a custom implementation I did at work, for our datasets, I was able to get good results with it. It beat UDA alone and other contrastive techniques I tried. As an aside, it only worked if I used a much larger unlabeled batch size (7x multiplier) this is similar to the released code, but the paper claimed it should work 1 to 1.

I ran an extensive hyperparameter search to see if MPL Loss helps at all, it seemed to make no real difference no matter the settings (at least on the several internal problems I tried it on). They are of comparable size and difficulty to CIFAR-10. One is much larger and closer to ImageNet. I also tried several networks. The hyperparameter search did not tend towards keeping it or not, there was no statistical difference among the temperatures of the loss, including a temp 1.0 which disables the loss, and taking the best settings with or without MPL loss active seems to make no difference. Maybe it helps for ImageNet or Cifar-10, but, experiments here don't support that. It certainly does not help for the various problems I tried it on. That being said, the procedure itself works quite well, just not due to MPL Loss I think.

@kekmodel not sure if you have run experiments with larger unlabeled batch sizes, but it's probably worth trying, as I couldn't get it to work without this, but it performs better than anything else I tried, under this setting.

zxhuang97 commented 3 years ago

@monney Thank you for your valuable insights. I have some follow-up questions regarding your experiments.

  1. When you say "it beat UDA alone", do you mean "MPL+UDA+large unlabeled batch size" beats "UDA+large unlabeled batch size"? Is it possible that the performance gain comes from a larger batch size?

  2. In the third paragraph, do you mean that MPL doesn't work for your own problem even after hyper-parameter search(including larger batch size)?

Thanks : )

monney commented 3 years ago

@monney Thank you for your valuable insights. I have some follow-up questions regarding your experiments.

  1. When you say "it beat UDA alone", do you mean "MPL+UDA+large unlabeled batch size" beats "UDA+large unlabeled batch size"? Is it possible that the performance gain comes from a larger batch size?

all my experiments for both were done with larger unlabeled batch sizes and similar training. The benefit almost certainly comes from the self distillation procedure, and the unique finetuning phase of MPL.

  1. In the third paragraph, do you mean that MPL doesn't work for your own problem even after hyper-parameter search(including larger batch size)?

It works, and works better the other contrastive learning methods I’ve tried (UDA, BYOL, SimCLR, NoisyStudent). But the actual MPL loss seems to have no major effect on the results and I think the other differences of this paper are largely responsible for the increased performance. My guess is in the end the paper ends up being very similar to fixmatch.

cheers

zxhuang97 commented 3 years ago

@monney I see. That's a little surprising as the MPL objective makes a lot of sense to me. Also, figure.3 in the appendix breaks down the contribution of each component, and it shows that whether using the MPL loss will make a huge difference.

monney commented 3 years ago

@zxhuang97 it makes a lot of sense to me as well, so confusing. I’ll update if I find bugs or anything, but I’ve done a lot of testing. The breakdown in fig 3. will include the entire MPL procedure I’m pretty sure, so it’s difficult to isolate just the loss contribution. UDA is just the standard UDA procedure.

zxhuang97 commented 3 years ago

@zxhuang97 it makes a lot of sense to me as well, so confusing. I’ll update if I find bugs or anything, but I’ve done a lot of testing. The breakdown in fig 3. will include the entire MPL procedure I’m pretty sure, so it’s difficult to isolate just the loss contribution. UDA is just the standard UDA procedure.

I guess you're right. The UDA module in the official implementation doesn't include the teacher&student stuff, so it's not really a fair comparison. Thank you for the information!

jacobunderlinebenseal commented 2 years ago

When training converges, theoretically, both s_loss_old - s_loss_new and s_loss_new - s_loss_old will be zero, is this the way it should be? Has anyone tried the none Taylor approximation way to calculate the dot product? Does it work?

Jacfger commented 2 years ago

@jacobunderlinebenseal Wouldn't it make sense for it to be zero tho? The idea for the teacher model was to receive feedback from the performance of the student. When it is good enough (or "converging" I suppose), shouldn't it having a near zero update?

milanlx commented 2 years ago

image This might be a clearer derivation.

One question: in the paper the product is between supervised and unsupervised gradient, which is different from the code.

DaehanKim commented 4 months ago

I read the whole thread and did a derivation again. I believe correct implementation is old_s_loss - new_s_loss

1-order tayler expension goes $$f(x) = f(a) + f'(a)(x-a)$$

Let $a = x+h$ and $a$ be new parameters. and $f(\cdot)$ is cross entropy loss as above.

Then old_s_loss - new_s_loss becomes $$f(x) - f(x+h) = -f'(x+h) \cdot h$$

and by definition

h=-\eta_{s}\nabla_{\theta_s}CE(\hat{y}_u, S(x_u,\theta_{s}))

and $f'(x+h)$ becomes

f'(x+h) = \nabla_{\theta_{s}^{'}} CE(y_l, S(x_l,\theta_{s}^{'}))

Thus,

f(x)-f(x+h) =  \eta_{s}\nabla_{\theta_s}CE(\hat{y}_u, S(x_u,\theta_{s})) \nabla_{\theta_{s}^{'}} CE(y_l, S(x_l,\theta_{s}^{'}))

And this quantity is what we see in the paper

image