szpiech / selscan

Haplotype based scans for selection
GNU General Public License v3.0
109 stars 33 forks source link

iHS ancestral/derived and EHH cutoff #62

Closed gabyrech closed 3 years ago

gabyrech commented 3 years ago

Dear @szpiech,

First of all, thanks so much for developing and maintaining selscan!

I have two questions regarding the --ihs module:

1) How does selscan consider the "0" & "1" encoding in terms of ancestral/derived alleles?. In the manual, page 6, it says "...the ancestral (0) and derived (1) haplotypes...", but later (page 14) it states "...to write out left and right iHH scores for ancestral (1) and derived (0) in addition..." More importantly, what is the meaning of positive/negative iHS values? I found this thread which kind of addresses this question but, does it also apply to iHS? Original iHS introduced by Voight et al. (2006) was supposed to do it in the other way around (negative iHS scores indicate that a derived allele (1) has swept up in frequency), right?  

2) How  modifying the EHH cutoff (e.g. --cutoff 0.01) might affect the biological meaning of the results? I mean, I run iHS with default EHH cutoff (0.05) and also decreasing to 0.01. As expected, I found selscan was hable to estimate iHS values for many more SNPs with the 0.01 cutoff. However, after normalization, I found some SNP that were iHS > |2| when using 0.05, were not longer in the 0.01 run. Could you please help me to understand why this could happen?

Thanks so much for your time reading (and hopefully addressing) this! All the best, Gabriel

https://github.com/szpiech/selscan/issues/35#issuecomment-475255416

szpiech commented 3 years ago

Thanks for your kind words!

First, thanks for catching that typo in the manual! So, to be clear selscan doesn't even know the polarization of the alleles. It just (for iHS) computes the iHH for the "1" haplotype vs the iHH for the "0" haplotype. Polarizing your alleles helps with power a bit (for iHS and nSL) due to the normalization by frequency bin step, but it still performs very well with arbitrary coding (e.g., ref/alt). I did, unintentionally, reverse the ratio from Voight et al 2006 when I implemented it. So in general selscan's ihs scores are the negative of Voight et al's.

Regarding interpretation, if there are a lot of long and low diversity haplotypes (characteristics of an ongoing sweep) attached to the "1" allele at a locus then iHH1 will tend to be large. If iHH1 is large relative to to iHH0, then iHS will be positive. The same idea applies to long and low diversity haplotypes on "0" alleles, but iHS will be negative. If |iHS| is an extreme outlier, this is consistent with an ongoing sweep. |iHS| > 2 individually are not particularly interesting, as this is a relatively low bar to achieve even under neutrality. More extreme individual scores are less likely to be false positives, but Voight et al also showed that look if you for clusters of extreme scores >2, this can be a more powerful way to find putatively selected regions.

Regarding --cutoff, off the top of my head, I don't see why lowering this would produce more sites with scores (if anything I would expect fewer sites with scores unless --trunc-ok was set). In general, setting this lower will incorporate more of the signal into the integration step, although with diminishing returns. If you're finding sites with |iHS| ~2 falling below 2, they probably weren't very meaningful to begin with. If you are finding e.g. |iHS| > 7 suddenly dropping below 2, then something very strange is happening.

It occurs to me that you might be referring to --maf instead of --cutoff, as this too has a default of 0.05 and if you were to lower it to 0.01 you would certainly gain more sites with scores. When you lower the MAF cutoff you're incorporating more low frequency variants into haplotype identity calculations and potentially "swamping" the signal a bit, which probably accounts for the changes you're seeing. It is conceivable that modifying the --maf cutoff could slightly change the time-scale at which selection is best identified (in large sample sizes), but I haven't tested this at all.

Hope this helps!

gabyrech commented 3 years ago

Dear @szpiech, Thanks sooo much for your absolutely clear and detailed response! It helps a lot and clarifies most of my doubts!

About the --cutoff parameter, you are right, I did also include the --trunc-ok (sorry for not mention it before), so that is the reason I get more sites with scores. I will check whether |iHS| values falling below 2 show a big difference as you suggest. Thanks also for the --maf suggestion. I`ll give it a try as well.

Thanks again for developing and maintaining such a great piece of software! All the best, Gabriel