santiago1234 / mxb-genomes

Demographic Modeling of Admixed Latin American Populations from Whole Genomes
https://doi.org/10.1101/2023.03.06.531060
MIT License
3 stars 1 forks source link

How to estimate the mutation rate in coding sequences/regions? #9

Closed santiago1234 closed 2 years ago

santiago1234 commented 2 years ago

First, I need to obtain the coding genome (sequences) from GRCh38.

After that:

Screen Shot 2021-12-15 at 14 30 59

For each focal SNP, we have three contributions to the mutation rates, for example, if all the mutations are synonymous the contribution of that focal SNP to the missense rate will be zero.

santiago1234 commented 2 years ago

I talked with Aaron last Friday and the logic for this is correct.

Aaron just told me to make sure I look in the correct DNA strand.

santiago1234 commented 2 years ago

Here is a hack, I can use VEP to annotate all coding SNPs. Let VEP do it for me.

~/ensembl-vep/vep -i exon-snps.txt.gz --cache  \
    --assembly GRCh38 --tab --output_file variants.txt.gz \
    --compress_output gzip  --fields \
    "Uploaded_variation,Location,Allele,Gene,Feature_type,Consequence,Codons"
santiago1234 commented 2 years ago

NOTE: VEP repeats lines. In the pipeline, I should subset unique lines, for example:

grep 'missense' variant_effect_output.txt |sort|uniq
santiago1234 commented 2 years ago

I will put this in a pipeline, to see the results.