How to estimate the mutation rate in coding sequences/regions?

santiago1234 / mxb-genomes

Demographic Modeling of Admixed Latin American Populations from Whole Genomes

https://doi.org/10.1101/2023.03.06.531060

MIT License

3 stars 1 forks source link

How to estimate the mutation rate in coding sequences/regions? #9

Closed santiago1234 closed 2 years ago

santiago1234 commented 2 years ago

First, I need to obtain the coding genome (sequences) from GRCh38.

After that:

Each sequence is scanned to compute the mutation rate for each focal SNP.
The mutation will affect one codon depending on the frame: there are three frames.
There will be three mutation rates:
- Synonymous: same amino acid
- Missense: aminoacid change
- LOF: introduces a stop codon

Screen Shot 2021-12-15 at 14 30 59

For each focal SNP, we have three contributions to the mutation rates, for example, if all the mutations are synonymous the contribution of that focal SNP to the missense rate will be zero.

santiago1234 commented 2 years ago

I talked with Aaron last Friday and the logic for this is correct.

Aaron just told me to make sure I look in the correct DNA strand.

santiago1234 commented 2 years ago

Here is a hack, I can use VEP to annotate all coding SNPs. Let VEP do it for me.

~/ensembl-vep/vep -i exon-snps.txt.gz --cache  \
    --assembly GRCh38 --tab --output_file variants.txt.gz \
    --compress_output gzip  --fields \
    "Uploaded_variation,Location,Allele,Gene,Feature_type,Consequence,Codons"

santiago1234 commented 2 years ago

NOTE: VEP repeats lines. In the pipeline, I should subset unique lines, for example:

grep 'missense' variant_effect_output.txt |sort|uniq

santiago1234 commented 2 years ago

I will put this in a pipeline, to see the results.