Closed astorfi closed 3 years ago
Hi,
Thank you for pointing it out. In this implementation of the code, an extra log(softmax) has been used. The CE loss already has log(softmax). So effectively, the code actually does log(softmax(log(softmax))).
I ran some experiments removing the extra terms and the results were mostly the same (margin of difference being 0.2 dice ). Technically, using an extra log(softmax) does not change the scale of the values; so we observe that the training is as stable as using the normal CE. However, I agree that it is not needed. I will correct the code in the next commit.
Hi,
I think there is an issue with the loss function implementation:
I think this loss function implementation is not correct. You may get the results as with logsoftmax of the outputs, still, the loss function can do a descent job.
Please investigate.