Open sokrypton opened 2 months ago
That's right - But this is not a bug. Perhaps we should rename the util function to clarify that this is the unmasked (WT) version of the pseudo-log-likelihood as described in the paper. As correctly pointed out, computing the PLL requires L inference steps, one for each position. WT-PLL on the other hand requires only one, making large scale indel effect prediction way more efficient.
We've seen that while WT-PLL significantly overestimates the exact PLL (due to conditioning on the WT amino acid), it is a very good approximation in maintaining the relative ranking across variants! Here is one example where we randomly mutate up to four residues in INS (P01308) and measure the Spearman correlation between WT-PLL and the true/exact PLL, as computed by ESM-1b.
In this example the overall Spearman correlation is ρ=0.94. It would be interesting to demonstrate that this relationship holds across many proteins and different kinds of mutations. In the paper we show that the unmasked (WT) version of the pseudo-log-likelihood suffices to accurately predict the pathogenicity of ClinVar indels.
It's a bug because you call it pseudo-likelihood but it is not computing pseudo-likelihood.
The issue is now others are using your code and calling it pseudo-likelihood. So it would be good to fix it in the code and ideally also in paper.
Maybe you can call it unmask-CCE (categorical cross entropy)? Putting pll or ll in the name will upset people 😅
What is reported in github is not pseudo-log-likelihood: https://github.com/ntranoslab/esm-variants/blob/main/esm_variants_utils.py#L90
Pseudo-likelihood requires masking each position one-by-one, and then only evaluating the cross-entropy at masked positions only.