Ancestral / derived alleles and XP-scans

andreaswallberg commented 3 years ago

Dear @szpiech & Co!

Thanks for such a great tool! It is by far, the most practical and powerful one around for complementing regular FST/PBS divergence scans with haplotype evidence.

I have two questions:

1) In the absence of complete information on the ancestral / derived state to encode 0 / 1 variants for analyses, what approach do you recommend? Does it matter at all, when doing XP-EHH or XP-nSL if the 0/1 encoding is consistent across variants from different positions along a chromosome? I can think of at least four ways to encode the data:

Take reference as ancestral / 0 and alt as derived / 1
Take the global major allele as ancestral / 0 and the minor allele as derived / 1
Use any information one has on ancestral / derived states and only include SNPs with such evidence (may lead to drastic filtering of variants in the case of fragmented evidence, for example mostly over conserved exons across species).
Use any information on has (as above) and complement it with either the first or second approach (essentially a hybrid approach, with the drawback of making it less clear what evidence was used on a snp-by-snp basis)

2) Does it make sense to filter candidate outliers for local adaptation using both FST and XP-nSL independently (or use a composite score of both) to extract SNPs with the strongest evidence?

szpiech commented 3 years ago

Thanks for the kind words, and I'm glad you find the software useful.

As for 1., it turns out that ancestral state doesn't really matter at all for XP-nSL (this is not the case for nSL, where you have to normalize the results by allele frequency bins), so I just use ref/alt coding for convenience. Perhaps in your downstream analyses, it would be useful or interesting to know the ancestral state of mutations in a putatively selected region, but for the well-functioning of the statistic it does not matter.

Regarding 2., I'm certainly a fan of using multiple lines of evidence for identifying regions undergoing selection. I suppose the only thing to consider is that e.g. XP-nSL, Fst, and PBS are powered differently across the theoretical parameter space, so one stat might not show super strong evidence where the others do in certain cases. But in general, I think it is a nice approach to line up multiple lines of evidence, and you could even go as far as using something like CMS or SWIF(r) if you have a good idea what your demographic history is.

andreaswallberg commented 3 years ago

Many thanks!

A quick followup question, the "--max-extend-nsl" default value of 100 represents 100 SNPs with MAF > cutoff, right?

Sweeps can vary quite a bit in length in the same dataset. How should this default value be adjusted to fit both wide or narrow sweeps?

szpiech commented 3 years ago

A couple things:

nSL and iHS type stats should be filtered for MAF as they perform better this way (and selscan should filter those by default I think for these stats). I've tested via simulations various variant filtering schemes for two-pop stats (i.e., XP-EHH and XP-nSL), and they work best with all variants included.

--max-extend-nsl value represents 100 SNPs regardless of MAF. By default the nSL implementation filters MAF < cutoff, it is is effectively 100 SNPs with MAF > cutoff. For XP-nSL, it is just 100 SNPs. Incidentally, I believe this is actually 100 SNPs "to each side" of the core SNP (the one to which the statistic value is assigned), so it really represents 201 SNPs centered on the core SNP (100 upstream + 100 downstream + core).

But there is a subtle meaning to this flag, it's fundamental purpose is to increase runtime efficiency. It represents only the maximum haplotype identity value "possible" for a pair of haplotypes. A moderate sweep may not have pairwise haplotype identities that stretch as far as 100 SNPs flanking the core, so those will be counted "in full". But if you have a super strong sweep, one could conceivably encounter a pair that has identity at maybe 300 SNPs, but it would only be counted as 200 SNPs. It truncates the signal a bit, but since the computation effectively considers the mean of all pairs of haplotypes, you'd have to encounter an absolutely enormous selection coefficient to "max out" the sL component of the ratio. In practice this truncation doesn't have much of an effect, and even a maxed out sL is so unusual you're sure to notice it in your results.

That said, you do want to choose a value "high enough" that most haplotypes under neutrality are much smaller. I suspect 100 is sufficient for most purposes, but possibly not in an organism with very low recombination/mutation. You can always raise this value arbitrarily high, and you'll be unlikely to degrade runtime performance too much.

Hope this helps!

szpiech / selscan

Ancestral / derived alleles and XP-scans #60