szpiech / selscan

Haplotype based scans for selection
GNU General Public License v3.0
111 stars 33 forks source link

XPEHH P-hacking? #83

Open Niraj-Rayamajhi opened 2 years ago

Niraj-Rayamajhi commented 2 years ago

Hi szpiech,

I have a question.

To give you a context for a question, I would like provide some information.

There are two SNP-data sets derived from two species diverged more than 11 million years ago and these two species inhabit different environments such that there is no gene flow between them.

For XPEHH analysis, I used all the phased SNP data from RADseq loci without pruning for LD.

For XPEHH analysis, first I estimated the statistics for gaps (mean, median, and maximum gap) between RAD loci as well as between SNPs.

Then, while performing the analysis, I used a) median gap size (that I estimated for RAD loci) value to scale the gap, and b) gap size below which there are 99% of the gaps as a maximum gap size to tolerate

If I do the math using formula: XPEHH = ln(iHHpop1/iHHpop2) XPEHH will be equal to log(iHHpop1) - log(iHHpop2)

If large gaps are not accounted for, should not iHHpop1 and iHHpop2 inflate by the same amount, thus XPEHH remains same?

How does XPEHH is affected by large gaps? Is one of the values log(iHHpop1) or log(iHHpop2) is only affected with gaps?

After pondering more, I think large gaps can affect iHH of any one or both populations and thus can affects xpehh. This is because the distance between neighboring SNPs and core SNP can vary between populations. And, this affects the area size under the curve.

Then, I also thought does not XPEHH use the core SNPs and the neighboring SNPs that are at the same position in both populations?

Another question is: can lower values for scale-gap parameter and maxgap parameters can generate false positive signals that are not present while employing higher values for the parameters.

I am asking this because when I lowered the parameters (as opposite to higher values), I am getting significant q-values (after adjusted p values) and I am wondering it can be a situation of p-/q-hacking.

Also, I think there is no way to know whether the maxgap value is more than gaps near centromeres as some species chromosomes can be metacentric or acrocentric. By this I also mean there is no way to know about centromere positions for atleast non-model species. So, can't know which maxgap size is larger than gaps formed in centromeric regions. That is why I used gap size as maxgap size above which there is top 1% largest gap size.

Please let me know your opinion. It will be helpful!

Regards, Niraj

szpiech commented 2 years ago

Hi Niraj,

I don't really have a good intuition about how this might affect the distribution of scores, but I think it would be worth investigating via simulations so that you can come to a rigorous conclusion. You might also consider the XP-nSL version of the statistic, which, effectively, measures distance in number of snps instead of physical or genetic distance.

-Zachary

On Thu, May 5, 2022 at 1:41 PM Niraj-Rayamajhi @.***> wrote:

Hi szpiech,

I have a question.

To give you a context for a question, I would like provide some information.

There are two SNP-data sets derived from two species diverged more than 11 million years ago.

For XPEHH analysis, I used all the SNP data from RADseq loci without pruning for LD.

Then, I estimated the statistics for gaps (mean, median, and maximum gap) between RAD loci as well as between SNPs.

For XPEHH anaysis, I used a) median gap size (that I estimated for RAD loci) value to scale the gap, and b) gap size below which there are 99% of the gaps as maximum gap size.

If I do the math using formula: XPEHH = ln(pop1ies/pop2ies) where pop1ies is integrated EHHS for population 1 and pop2ies is integrated EHHS for population 2

XPEHH = log(pop1ies) - log(pop2ies)

If large gaps are not accounted for, should not pop1ies and pop2ies inflate by the same amount such that XPEHH remains same?

How does XPEHH is affected by large gaps? Is one of the values log(pop1ies) or log(pop2ies) is only affected with gaps?

I am asking this to know if using lowering values for scale-gap parameter and maxgap parameters can generate false positive. When I lowered the parameters, I am getting significant q-values (after adjusted p values) and I am wondering it can be a situation of p-/q-hacking.

Please let me know your opinion. It will be helpful!

Regards, Niraj

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/83, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQXP7KFCMFEOENOFXTDVIQB4NANCNFSM5VFY6U4A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Niraj-Rayamajhi commented 2 years ago

Hi Szpiech,

Thank you for the suggestion.

From output of XP-nSL's normalization analysis performed with norm module, I used values with the header normxpehh (should have been normxpnsl in the header though) to convert them into p-values using two-tailed test with pnorm function in R and then converted to p-values to q-values. I did not get any significant hits (q values < 0.05) across the genome.

Even in case of XP-nSL analysis, are the values affects by large gaps? I am asking this because if gaps does not affect XP-nSL values, then I should be using --max-gap value greater than the largest gap I see in my dataset and --gap-scale value as the average distance between loci.

Thank you for your feedback, Niraj

szpiech commented 2 years ago

Hello,

I think the best approach would be to look for clusters of extreme scores (either >2 or <-2) in a window, as this is more powerful than looking for single extreme values. I do this in Szpiech et al 2021 (doi:10.1002/evl3.232) if you're curious.

-Zachary

On Sun, May 15, 2022 at 2:23 AM Niraj-Rayamajhi @.***> wrote:

Hi Szpiech,

Thank you for the suggestion.

From output of XP-nSL's normalization analysis performed with norm module, I used values from the header normxpehh (should have been normxpnsl in the header though) to convert them into p-values using two-tailed test with pnorm function in R and then converted to p-values to q-values. I did not get any significant hits (q values < 0.05) across the genome.

Even in case of XP-nSL analysis, are the values affects by large gaps? I am asking this because if gaps does not affect XP-nSL values, then I should be using --max-gap value greater than the largest gap I see in my dataset and --gap-scale value as the average distance between loci.

Thank you for your feedback, Niraj

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/83#issuecomment-1126869755, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQR7YLMH6PEBFVOFJWTVKCJ4LANCNFSM5VFY6U4A . You are receiving this because you commented.Message ID: @.***>