sairin1202 / BIC

pytorch implementation of Large Scale Incremental Learning
64 stars 20 forks source link

The reason why the classification accuracy is different from the result of the paper #1

Open userDJX opened 5 years ago

userDJX commented 5 years ago

Thank you for your contribution. I would like to ask why there is a big error between the experimental results and the paper. I don't know right now.

sairin1202 commented 5 years ago

The reason might be the distillation loss that I did not implement.

userDJX commented 5 years ago

I was first introduced to incremental learning, and when I read your code, I found that you adjusted the parameters a little bit to make your code more accurate to the paper. As for the distillation loss, I think what you wrote is consistent with the paper. I don't know right now.

userDJX commented 5 years ago

Excuse me, do you have any way to achieve the same level of results as the paper? I hope you can help me, 1109039558@qq.com, thank you

sairin1202 commented 5 years ago

I think the best way is to contact the author of that paper.

srvCodes commented 4 years ago

A major issue with your implementation is that the layers of the main model are trainable even while adjusting the bias correction parameters when it should ideally be frozen. Also, the bias layer's parameters should be frozen during training the FC and convolutional layers.

userDJX commented 4 years ago

Hello, I seem to have found a problem with this code. If the sample set is removed and the bias is removed, the remaining part should be LwF, but when I run the LwF algorithm, I still find the result is wrong.

I'm thinking about two things, one is that the FC layer is directly output to 10 in this code, and the other is the part of network parameters. I feel that as long as the accuracy of that part of LwF algorithm is improved, the accuracy of this code will be improved, but My ability is limited, I hope you can help me ~

sairin1202 commented 4 years ago

@srvCodes Thank you for pointing out my mistakes. I have changed the code. @userDJX After fixing the parameter when training, the results seem better. Thank you for your response and advice.

userDJX commented 4 years ago

If you want to send to improve accuracy of incremental, can try to modify the size of the train_x, from 9000 to 10000, in cifar100. Add in py, as well as in the BIC algorithm, the paper just changed the new category of bias, the old class bias did not change, this you can see, the last point is that you need in front of the previous_model plus with the torch. No_grad or self. Previous_model. Eval ()

userDJX commented 4 years ago

I hope your algorithm can reach the result of the paper as soon as possible

userDJX commented 4 years ago

I duplicated it successfully, and the result was 0.817 0.7265 0.6555 0.5971 0.5561. I checked the experimental part of BIC paper again, and found that the author might deliberately choose the best data to write in the paper. The reason is that in Figure 8 of the paper, the first 20 categories do not increase. If the same model is selected, such as ResNet, for training, then the purple circle is unlikely to be 2% higher than other coils, such as ICARL

sairin1202 commented 4 years ago

It's possible. Thank you for your help!

srvCodes commented 4 years ago

I have incorporated the same with a dynamic model and a couple of other details, e.g., the authors say that the bias correction should be done only after the second incremental batch has arrived. You can find the implementation at https://github.com/srvCodes/continual-learning-benchmark. @sairin1202 - thanks for having made your code public, would not be possible without that. :+1:

EdenBelouadah commented 3 years ago

Hello, Please, I wonder why you multiply the distillation loss by T² and not by alpha?

I would say that the original formula of distillation is:

loss = alpha * loss_soft_target + (1-alpha) * loss_hard_target

instead of

loss = loss_soft_target * T * T + (1-alpha) * loss_hard_target

bwolfson97 commented 3 years ago

@EdenBelouadah I think they scale the distillation loss by T² because that's they say to do in the original knowledge distillation paper when using both soft and hard targets in the loss:

"Since the magnitudes of the gradients produced by the soft targets scale as 1/T² it is important to multiply them by T² when using both hard and soft targets. This ensures that the relative contributions of the hard and soft targets remain roughly unchanged if the temperature used for distillation is changed while experimenting with meta-parameters."

  • (See the last paragraph of section 2 "Distillation" here).

I don't think they do this scaling though in the original Large Scale Incremental Learning paper though. (See calculation of loss here). It looks like in the original implementation, they use:

loss = alpha * loss_soft_target + (1-alpha) * loss_hard_target

as described in the paper.

EdenBelouadah commented 3 years ago

@EdenBelouadah I think they scale the distillation loss by T² because that's they say to do in the original knowledge distillation paper when using both soft and hard targets in the loss:

"Since the magnitudes of the gradients produced by the soft targets scale as 1/T² it is important to multiply them by T² when using both hard and soft targets. This ensures that the relative contributions of the hard and soft targets remain roughly unchanged if the temperature used for distillation is changed while experimenting with meta-parameters."

  • (See the last paragraph of section 2 "Distillation" here).

I don't think they do this scaling though in the original Large Scale Incremental Learning paper though. (See calculation of loss here). It looks like in the original implementation, they use:

loss = alpha * loss_soft_target + (1-alpha) * loss_hard_target

as described in the paper.

Thank you for the answer. I understand the use of T². However, the distillation used here is:

loss = loss_soft_target * T * T + (1-alpha) * loss_hard_target

I still don't understand why "loss_hard_target" is multiplied by (1-alpha)? alpha is supposed to weighten the contribution of distillation vs. classification loss, isn't it? (I mean shouldn't we multiply "loss_soft_target T T" by alpha? Thank you