DPOP incentivizes every token toward the preferred completion, where vanilla DPO does not do this for earlier tokens
from NLL paper: "We see that for DPO without NLL loss there is a decrease over training for the chosen sequences, whereas for DPO with NLL there is not, which may help explain the improved performance of the latter."
We could implement these as additional float parameters in the DPO loss functions to added a weighted loss term, or as separate loss functions that you would combine. We should first assess how impactful these papers have been.
Several modifications to the DPO loss function have shown to improve DPO model quality. These include adding a weighted negative log-likelihood loss (https://arxiv.org/pdf/2404.19733) and a DPO-positive loss (https://arxiv.org/pdf/2402.13228)
We could implement these as additional float parameters in the DPO loss functions to added a weighted loss term, or as separate loss functions that you would combine. We should first assess how impactful these papers have been.
cc @SalmanMohammadi