ntranoslab / esm-variants

MIT License
66 stars 12 forks source link

Unrealistic results for frameshifts. #9

Closed esiefker closed 3 months ago

esiefker commented 9 months ago

Running 'esm_score_multi_residue_mutations.py' gives me unrealistic results for frameshift variants.

e.g. I prepared a query for NM_000382.3 c.103delC

MELEVRRVRQAFLSGRSRPLRFRLQQLEALRRMVQEREKDILTAIAADLCKSEFNVYSQEVITVLGEIDFMLENLPEWVTAKPVKKNVLTMLDEAYIQPQPLGVVLIIGAWNYPFVLTIQPLIGAIAAGNAVIIKPSELSENTAKILAKLLPQYLDQDLYIVINGGVEETTELLKQRFDHIFYTGNTAVGKIVMEAAAKHLTPVTLELGGKSPCYIDKDCDLDIVCRRITWGKYMNCGQTCIAPDYILCEASLQNQIVWKIKETVKEFYGENIKESPDYERIINLRHFKRILSLLEGQKIAFGGETDEATRYIAPTVLTDVDPKTKVMQEEIFGPILPIVPVKNVDEAINFINEREKPLALYVFSHNHKLIKRMIDETSSGGVTGNDVIMHFTLNSFPFGGVGSSGMGAYHGKHSFDTFSHQRPCLLKSLKREGANKLRYPPNSQSKVDWGKFFLLKRFNKEKLGLLLLTFLGIVAAVLVKAEYY,MELEVRRVRQAFLSGRSRPLRFRLQQLEALRRMVRSARRIS,35

I get a score of 59.96288

I'm pretty sure a variant that deletes 90% of this gene would be deleterious. Why is esm-variants telling me it's benign?

nadavbra commented 9 months ago

@esiefker PLLR was designed for scoring relatively small indels and has never been tested for frameshift variants (in particular ones that delete most of the protein). Finding the best way to deal with frameshifts would require additional research and evaluation. A simple idea I could suggest is trying to use weighted=True in the get_PLLR function (in the esm_variants_utils Python module). In your case, it would consider the average likelihood of residues instead their total likelihood, which might be more suitable when the mutated sequence is much smaller than the wild-type sequence. In your specific example, the average likelihood is -0.37 for the mutated sequence and -0.15 for the WT sequence, so the length-weighted PLLR would be -0.21.

esiefker commented 9 months ago

Doesn't that number still seem high? If the threshold is around -7.5 for pathogenicity, -0.21 seems high.

I'm seeing the same thing with a stop gain, which is reported to work in the paper. Figure 6 contains a stop gain at position 25, in what appears to be AIRE, which is 545aa long.

I have a stop gain at position 42 in my 485aa protein, which is getting a score of 59.587837.

If I understand correctly, the score for a stop gain is supposed to be whatever the most deleterious missense mutation is past that point. But D48W scores -18.84. Shouldn't L42* be equal to or less than that score?

And, just a general question, why are frameshifts considered different from indels in the protein context? Is it just that they tend to be larger? Is there a length threshold beyond which the scoring is unreliable? test_output.csv

nadavbra commented 9 months ago

You cannot compare the LLR scores of missense mutations (where a reasonable threshold would be -7.5) to the PLLR scores of other types of mutations, which are on a totally different scale. The length-weighted PLLR scores in particular have a very different scale.

And yes, I think the main reason vanilla PLLR wouldn't generalize well to framshift variants is that the wild-type and mutated sequence tend to be of totally different lengths. I'm not sure what a good length threshold would be, because we've never looked into that. If you want to look into it, I'd recommend looking at the lengths of indel mutations in ClinVar, which would give you a ballpark figure of what mutation lengths were studied in our paper (as we know that PLLR generally performs well for these indels).

nadavbra commented 9 months ago

For stop gain mutations, I'd recommend taking the lowest LLR score of missense mutations following the stop-gain site, as we did in the paper, rather that using PLLR scores.

esiefker commented 9 months ago

Thanks, I wasn't aware the scale would be different. Is there a recommended threshold for the weighted PLLR?

nadavbra commented 9 months ago

We haven't studied the scale of the other scores. If you want to find out what a sensible threshold would be, you can look at our ClinVar indels benchmark (which can be downloaded as a CSV file from this GitHub repo) and compare the score distributions of pathogenic and benign mutations (that's basically what we did for missense LLR scores to figure out that -7.5 is a sensible threshold).