Open userDJX opened 5 years ago
The reason might be the distillation loss that I did not implement.
I was first introduced to incremental learning, and when I read your code, I found that you adjusted the parameters a little bit to make your code more accurate to the paper. As for the distillation loss, I think what you wrote is consistent with the paper. I don't know right now.
Excuse me, do you have any way to achieve the same level of results as the paper? I hope you can help me, 1109039558@qq.com, thank you
I think the best way is to contact the author of that paper.
A major issue with your implementation is that the layers of the main model are trainable even while adjusting the bias correction parameters when it should ideally be frozen. Also, the bias layer's parameters should be frozen during training the FC and convolutional layers.
Hello, I seem to have found a problem with this code. If the sample set is removed and the bias is removed, the remaining part should be LwF, but when I run the LwF algorithm, I still find the result is wrong.
I'm thinking about two things, one is that the FC layer is directly output to 10 in this code, and the other is the part of network parameters. I feel that as long as the accuracy of that part of LwF algorithm is improved, the accuracy of this code will be improved, but My ability is limited, I hope you can help me ~
@srvCodes Thank you for pointing out my mistakes. I have changed the code. @userDJX After fixing the parameter when training, the results seem better. Thank you for your response and advice.
If you want to send to improve accuracy of incremental, can try to modify the size of the train_x, from 9000 to 10000, in cifar100. Add in py, as well as in the BIC algorithm, the paper just changed the new category of bias, the old class bias did not change, this you can see, the last point is that you need in front of the previous_model plus with the torch. No_grad or self. Previous_model. Eval ()
I hope your algorithm can reach the result of the paper as soon as possible
I duplicated it successfully, and the result was 0.817 0.7265 0.6555 0.5971 0.5561. I checked the experimental part of BIC paper again, and found that the author might deliberately choose the best data to write in the paper. The reason is that in Figure 8 of the paper, the first 20 categories do not increase. If the same model is selected, such as ResNet, for training, then the purple circle is unlikely to be 2% higher than other coils, such as ICARL
It's possible. Thank you for your help!
I have incorporated the same with a dynamic model and a couple of other details, e.g., the authors say that the bias correction should be done only after the second incremental batch has arrived. You can find the implementation at https://github.com/srvCodes/continual-learning-benchmark. @sairin1202 - thanks for having made your code public, would not be possible without that. :+1:
Hello, Please, I wonder why you multiply the distillation loss by T² and not by alpha?
I would say that the original formula of distillation is:
loss = alpha * loss_soft_target + (1-alpha) * loss_hard_target
instead of
loss = loss_soft_target * T * T + (1-alpha) * loss_hard_target
@EdenBelouadah I think they scale the distillation loss by T² because that's they say to do in the original knowledge distillation paper when using both soft and hard targets in the loss:
"Since the magnitudes of the gradients produced by the soft targets scale as 1/T² it is important to multiply them by T² when using both hard and soft targets. This ensures that the relative contributions of the hard and soft targets remain roughly unchanged if the temperature used for distillation is changed while experimenting with meta-parameters."
- (See the last paragraph of section 2 "Distillation" here).
I don't think they do this scaling though in the original Large Scale Incremental Learning paper though. (See calculation of loss here). It looks like in the original implementation, they use:
loss = alpha * loss_soft_target + (1-alpha) * loss_hard_target
as described in the paper.
@EdenBelouadah I think they scale the distillation loss by T² because that's they say to do in the original knowledge distillation paper when using both soft and hard targets in the loss:
"Since the magnitudes of the gradients produced by the soft targets scale as 1/T² it is important to multiply them by T² when using both hard and soft targets. This ensures that the relative contributions of the hard and soft targets remain roughly unchanged if the temperature used for distillation is changed while experimenting with meta-parameters."
- (See the last paragraph of section 2 "Distillation" here).
I don't think they do this scaling though in the original Large Scale Incremental Learning paper though. (See calculation of loss here). It looks like in the original implementation, they use:
loss = alpha * loss_soft_target + (1-alpha) * loss_hard_target
as described in the paper.
Thank you for the answer. I understand the use of T². However, the distillation used here is:
loss = loss_soft_target * T * T + (1-alpha) * loss_hard_target
I still don't understand why "loss_hard_target" is multiplied by (1-alpha)? alpha is supposed to weighten the contribution of distillation vs. classification loss, isn't it? (I mean shouldn't we multiply "loss_soft_target T T" by alpha? Thank you
Thank you for your contribution. I would like to ask why there is a big error between the experimental results and the paper. I don't know right now.