This repository contains resources, tools, and command-line tools developed for the paper, "Genome-wide prediction of disease variant effects with a deep protein language model" by Nadav Brandes, Grant Goldman, Charlotte H. Wang, Chun Jimmie Ye, and Vasilis Ntranos. A complete catalog of missense variant effect predictions is accessible here.
esm_variants_utils.ipynb
esm_variants_utils.py
esm_score_missense_mutations.py
esm_score_multi_residue_mutations.py
ClinVar_gnomAD_benchmark_with_predictions.csv
ClinVar_indel_benchmark_with_predictions.csv
ClinVar_stop_gains_benchmark_with_predictions.csv.gz
dms_assays.zip
Table_of_results.xlsx
These files contain all benchmark data, VEP predictions used for performance evaluation, and results, except HGMD variants (see below).
Most data used in this work is already within the public domain. Exceptions and other data sources are detailed in the paper and below:
The following dependencies are required:
pip3 install tqdm numpy pandas biopython torch fair-esm
Clone the repository:
git clone https://github.com/ntranoslab/esm-variants.git
cd esm-variants
python3 esm_score_missense_mutations.py --input-fasta-file /path/to/input.fasta --output-csv-file /path/to/output.csv
python3 esm_score_multi_residue_mutations.py --input-csv-file /path/to/input.csv --output-csv-file /path/to/output.csv
The input CSV file for multi-residue mutations should have three fields:
wt_seq
: the wild type (original) protein sequencemut_seq
: the mutated protein sequencestart_pos
: the starting position (1-indexed) of the mutation relative to the wild type sequence
Assuming an example FASTA file named example.fasta
:
>seq1
FISHWISHFQRCHIPSTHATARECRISP
>seq2
RAGEAGAINSTTHEMACHINE
You can calculate ESM scores for all possible missense mutations in these sequences:
python3 esm_score_missense_mutations.py --input-fasta-file example.fasta --output-csv-file esm_scores.csv
This will create a CSV file (esm_scores.csv
) that starts like this:
seq_id,mut_name,esm_score
seq1,F1K,-3.2310808
seq1,F1R,-2.872289
seq1,F1H,-3.4361703
...
Each row represents a possible missense mutation and its ESM score.
Assuming the following example.csv
:
wt_seq,mut_seq,start_pos
FISHWISHFQRCHIPSTHATARECRISP,FISHWISHFQRCHEESETHATARECRISP,14
MARGTYNMGKHFDA,MGTYNMGKHFDA,2
You can calculate ESM (PLLR) scores for the specified multi-residue mutations:
python3 esm_score_multi_residue_mutations.py --input-csv-file example.csv --output-csv-file esm_multi_residue_scores.csv
This will create a CSV file (esm_multi_residue_scores.csv
) that starts like this:
wt_seq,mut_seq,start_pos,esm_score
FISHWISHFQRCHIPSTHATARECRISP,FISHWISHFQRCHEESETHATARECRISP,14,-1.0078125
MARGTYNMGKHFDA,MGTYNMGKHFDA,2,1.0056415
Each row represents a multi-residue mutation and its ESM (PLLR) score.