harvardinformatics / degenotate

MIT License
41 stars 4 forks source link

MK test question: what VCF file settings to use? (from 1000 genomes project variants) #53

Open PVlasov93 opened 1 month ago

PVlasov93 commented 1 month ago

I've been trying to use this program to run an MK test using VCF files from the 1000 genomes project. However, I'm getting some really strange results with this method, and I'm worried I may be doing something incorrectly. For example, I've been using a set of chimpanzee genome VCFs (from alignments to the human genome) as an outgroup, but the synonymous and nonsynonymous mutation rates have not been consistent with what I saw when running PAML on aligned FASTAs.

Could you clarify, what settings should a VCF file have in order to work with degenotate? Currently, I've been using the following commands with bcftools view: -m2 -M2 -i "AF>=0.05" ; is there anything else I need to change about the VCF files to make them compatible? All the variants that show up in my file are SNPs, but does this program work with allele pairs in the 1000 genomes VCF format? Or do I need to do any other changes for that?

tsackton commented 1 month ago

Your bcftools filter looks okay to me, and degenotate should support any properly formatted VCF. If you have a small example (perhaps a single gene) that you could share the inputs for that is giving you strange results with degenotate that would help a lot with debugging what might be going on.

PVlasov93 commented 3 weeks ago

Sorry for not getting back sooner. I've been working on another project and the shared computer where I was running this is unavailable right now. Could you clarify a few things for me about the type of vcf files that the degenotate MK test function can use?

tsackton commented 3 weeks ago

(1) For polymorphisms, by default we don't do any filtering on allele frequency, we just count the number of sites in the genome with any segregating variation. To calculate fixed differences, we do require that the site be fixed between the ingroup and the outgroup, so a polymorphic site (that is, segegrating in the outgroup species) will not be treated as a fixed difference.

(2) We don't have any kind of fasta output implemented in degenotate, unfortunately.

PVlasov93 commented 3 weeks ago

Could you clarify, what does it mean that something is a fixed site? While I can't send the files right now, I know that the most common issue I got was 0's in the pS and/or dS column, even though that makes no sense with the PAML analysis I've run before. (sorry if this is a basic question, I have basically no background in population genetics, a thesis committee member just asked me to investigate the sites I'm working on by MK test).

tsackton commented 3 weeks ago

The MK test uses a contrast between segregating sites (varying within a population) and fixed sites (consistently different between species).

So in your human/chimp data, a segregating site is a position where multiple alleles are present within humans, while a fixed site is a position where all humans share one allele that is different from all chimpanzees (e.g., humans are all A/A, and chimps are all T/T).

These are then separately tabulated for amino-acid-changing mutations (dN, pN), and non-amino-acid-changing positions (dS, pS).

PVlasov93 commented 3 weeks ago

Thank you for explaining that. Since you mentioned A/A vs T/T as an example, how would your program treat a site with A/T vs T/T instead? Would it help to limit my VCF file to only homozygous sites?

tsackton commented 3 weeks ago

A site that is segregating in humans (A/T) and present only T/T in chimps is, by definition, not a fixed difference. This would be considered a segregating site in humans.

To put it another way, a fixed difference is a site that is always going to be different between any human, and any chimp. So if you pick two random individuals it will be always be A/A vs T/T.

Filtering to keep only homozygous sites would likely make your input data meaningless, as you'd throw out a lot of your signal.

PVlasov93 commented 3 weeks ago

Thanks for clarifying that, I probably misunderstood the issue from earlier. As far as the outgroup is concerned (chimp genomes in this case), is it better to have only one outgroup or a set? The chimp VCF I'm using has around 20 individuals across different populations, while the version I've used in other tests came from one of the major published chimp genomes (the individual is included in the VCF set, if my colleagues are correct).

tsackton commented 3 weeks ago

Either should be fine. If you already have a vcf with 20 individuals, it makes sense to use that. However, you need to make sure that it is mapped to the same reference genome as the human data (e.g. hg19)