szpiech / selscan

Haplotype based scans for selection
GNU General Public License v3.0
107 stars 33 forks source link

Interpretation of normalized XP-EHH #99

Open catferna opened 1 year ago

catferna commented 1 year ago

Dear @szpiech, thank you for developing this tool and making our life easier. I have a question regarding what seems extremely high XP-EHH scores. I run this analysis using array data of ±500,000 SNPs from two human populations. As can be seen in the plots (link below), there are "too many" peaks and values seem"too extreme". Never seen anything like this, at least in other articles comparing human populations. I already checked several times all the steps to run this analysis but I'm wondering if I'm missing something while cleaning the data for instance. Do you have any insights into what process or error could be generating this pattern for the XP-EHH? This is the first time I'm running this analysis. by position: [https://www.dropbox.com/s/atdvmshpc558efn/xp-ehhh%20norm.png?dl=0] by 200kn windows: [https://www.dropbox.com/s/yw4qrf2w728lvhf/xp-ehh_window.png?dl=0] Thank you very much! Catalina.

szpiech commented 1 year ago

Hi Catalina,

So, I've only really seen this sort of pattern when comparing two samples from the same population. If the raw xpehh statistic has very low variance, the normalization step could pop out very extreme scores. You might check how much variance there was in the raw statistic (this would be reported by norm in the normalization log file) just to see if these extreme scores are the result of dividing by a small variance.

-Zachary

On Tue, Jul 11, 2023 at 11:23 AM Catalina I. Fernández H. < @.***> wrote:

Dear @szpiech https://github.com/szpiech, thank you for developing this tool and making our life easier. I have a question regarding what seems extremely high XP-EHH scores. I run this analysis using array data of ±500,000 SNPs from two human populations. As can be seen in the plots (link below), there are "too many" peaks and values seem"too extreme". Never seen anything like this, at least in other articles comparing human populations. I already checked several times all the steps to run this analysis but I'm wondering if I'm missing something while cleaning the data for instance. Do you have any insights into what process or error could be generating this pattern for the XP-EHH? This is the first time I'm running this analysis. by position: [ https://www.dropbox.com/s/atdvmshpc558efn/xp-ehhh%20norm.png?dl=0] by 200kn windows: [ https://www.dropbox.com/s/yw4qrf2w728lvhf/xp-ehh_window.png?dl=0] Thank you very much! Catalina.

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/99, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQRZ2FJYOETF5M433Q3XPVVWJANCNFSM6AAAAAA2GFVUEE . You are receiving this because you were mentioned.Message ID: @.***>

catferna commented 11 months ago

Hi Zachary, thank you so so much for your reply! I would have never guessed this was the problem and nobody else I talked to noticed this earlier. So, the comparison I was (am!) trying to make is between individuals from two distinct indigenous populations in South America, and according to the logfile, the variance is indeed quite low (0.094). I run again the XP-EHH and compared these samples against Chinese individuals (CHB; 1000 Genomes) and now the variance is higher (0.157) but I still get "too many" and "extreme" values, so it's hard to interpret. According to your experience and knowledge, would you interpret the first fact (very low variance between two pops) as an indication that this test may not be suitable to detect selection in populations that probably split not too long ago? If so, is there any other tool that you would recommend? And my second question is what additional or different data filters (maf or other) in the data for the xp-ehh analyses itself could be added/ eliminated to be able to capture the 'true' estimate for this statistic? or maybe, would it make sense to change some of the default parameters for the normalization step? I would really appreciate any insight in this regard. Thanks a lot!

szpiech commented 11 months ago

Hi Catalina,

Hmm, well if these are two populations that actually cluster separately (e.g. on PCA or STRUCTURE analysis), then I might not necessarily expect these strange results. On the other hand I had someone report similar patterns (actually somewhat more extreme) when comparing two sets of data that were actually quite far diverged, which, given your description, also doesn't sound like your situation.

So, given these apparently inflated scores, I think you may want to adjust your critical value for the windowing analysis. I think if you looked at the empirical distribution of normalized scores and picked +/-Z that contains 95% of the mass, this might be a better choice. You can also analyze both with respect th CHB, and examine the overlap/differences. I wonder if these populations have fairly small effective population size and if this might affect the statistic at all. If you have a guess at their joint demographic history, you could try simulating and testing the statistic.

I assume you've filtered close relatives?

Zachary

On Tue, Aug 8, 2023 at 8:46 AM Catalina I. Fernández H. < @.***> wrote:

Hi Zachary, thank you so so much for your reply! I would have never guessed this was the problem and nobody else I talked to noticed this earlier. So, the comparison I was (am!) trying to make is between individuals from two distinct indigenous populations in South America, and according to the logfile, the variance is indeed quite low (0.094). I run again the XP-EHH and compared these samples against Chinese individuals (CHB; 1000 Genomes) and now the variance is higher (0.157) but I still get "too many" and "extreme" values, so it's hard to interpret. According to your experience and knowledge, would you interpret the first fact (very low variance between two pops) as an indication that this test may not be suitable to detect selection in populations that probably split not too long ago? If so, is there any other tool that you would recommend? And my second question is what additional or different data filters (maf or other) in the data for the xp-ehh analyses itself could be added/ eliminated to be able to capture the 'true' estimate for this statistic? or maybe, would it make sense to change some of the default parameters for the normalization step? I would really appreciate any insight in this regard. Thanks a lot!

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/99#issuecomment-1669545467, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQX4NK452MCIH5V54MDXUIYLNANCNFSM6AAAAAA2GFVUEE . You are receiving this because you were mentioned.Message ID: @.***>