The idea is to take a protein language model (PLM) and pre-train it on a large corpus of available protein sequences in a BERT fashion (mask random tokens, task is to predict them). Then, logits predicted by the model for a given position under wildtype
(wt) and mutated (mt) token are shown to be effective predictor of the pathogenicity:
Historically first: https://www.biorxiv.org/content/10.1101/2021.07.09.450648v2.full.pdf More extensive study: https://www.biorxiv.org/content/10.1101/2022.09.30.510294v3.full.pdf (new dataset: COSMIC + TCGA, new task: survival prediction) Another extensive study: https://arxiv.org/pdf/2211.10000.pdf (new task: rescue mutations impact) Extension of the baseline to any protein length: https://www.biorxiv.org/content/10.1101/2022.08.25.505311v1.full
The idea is to take a protein language model (PLM) and pre-train it on a large corpus of available protein sequences in a BERT fashion (mask random tokens, task is to predict them). Then, logits predicted by the model for a given position under wildtype (wt) and mutated (mt) token are shown to be effective predictor of the pathogenicity: