hemstrow / snpR

GNU General Public License v3.0
21 stars 7 forks source link

Request for function to calculate dN/dS or Ka/Ks ratio from VCF file #64

Open Rohit-Satyam opened 2 months ago

Rohit-Satyam commented 2 months ago

Dear @hemstrow

I was wondering if it would be easy to write a function to calculate dN/dS ration using snpEff annotated VCF file. Currently, there is no functional tool or package available to do that and it has turned out to be a nightmare for me. I raised a biostar query here where I use a python script but I am not sure if it is doing the right think since it doesn't account for synonymous/non-synonymous sites.

hemstrow commented 2 months ago

That's probably doable, I think. I'd have to think about how that'd work with the existing facet system.

hemstrow commented 2 months ago

Hey @Rohit-Satyam ,

I'm fairly busy at the moment and haven't gotten to work on this, but I should have a bit of time in a few weeks. If you are interested in having things go faster, I'd be happy to integrate any code you might have which calculates dN/dS or Ka/Ks ratios given any specific data. Otherwise, I can let you know when something is available.

Rohit-Satyam commented 2 weeks ago

Hi @hemstrow

First of all, congratulations on publishing your work in Nature Reviews Genetics. I am so happy you did such an impactful work and I am sharing it actively with my peers. For dN/dS I am currently grepping the number of synonymous and non-synonymous word from Pf7K VCF files because my organism is haploid. I am very new to the population genomics field and since I was unsure if I was doing it correctly, I reached out for help.

This is how I am currently doing it:

## This script is used for retriving synonymous and non-synonymous variants and decompose multiallelic sites
mkdir atomize_vcf
ls -1 gene_vcfs/*.vcf | while read p
do
n=$(echo $p | xargs -n 1 basename | cut -f 1 -d'.')
bcftools norm --multiallelics -any --check-ref e --fasta-ref /data/pf7/Pfalciparum.genome.fasta --old-rec-tag OLD_CLUMPED --atomize $p | bcftools norm --rm-dup exact --output-type v -o atomize_vcf/${n}.vcf -
grep 'ANN=.*synonymous_variant' atomize_vcf/${n}.vcf | awk '{print $1,$2,$3,$4,$5,$8}' > atomize_vcf/${n}.syn.txt
grep -E 'missense_variant|nonsense_variant|stop_gained|stop_lost' atomize_vcf/${n}.vcf | awk '{print $1,$2,$3,$4,$5,$8}' > atomize_vcf/${n}.nonsyn.txt
vcftools --vcf $p --freq --out temp
awk '{print $1,$2,$3}' temp.frq > results/${n}.freq.txt
done