Closed JosephDiPalma closed 3 years ago
Thanks for the great reproduction! It turns out to be an error of documentation. Our "inner product" loss function is half of the loss function in the BYOL paper. This will only modify the optimal learning rate for SGD.
In order to more precisely mimic the loss function of the BYOL paper, we have added a loss_constant_factor
to the hyperparameters. Set this to 2 to reproduce the paper loss function. We have added this suggestion to our BYOL section on the README.
Edit for comment: Why is this the loss we chose? The inner product loss used in our code corresponds to the standard InfoNCE loss without the softmax function. We can more easily compare between implementations when we remove the factor of two that has been added in the BYOL paper. This factor is most visible in their Eq 3, which is twice the usual InfoNCE loss.
The loss function for the BYOL model seems to be wrong. Below is a code snippet to demonstrate this issue.