Normalize XPEHH with overlap window

biozzq commented 8 years ago

Hi, I have calculated the unstandardized XPEHH for each sites, I want to normalize the XPEHH value with overlapped windows, but the norm can just do non-overlap normalize. How can I do this with overlap windows?

best

szpiech commented 8 years ago

I think you may be confusing two different calculations that norm can do. The first is the normalization of scores, which does not involve windows. The second is the search for clusters of extreme scores, which is implemented in norm by analyzing non-overlapping windows. Unfortunately, if you want a sliding window for the second case, you'll have to either modify the code or write a separate script to analyze the normalized scores. I'll make a note that this is a desired feature for norm, but I'm not sure when it will get implemented.

biozzq commented 8 years ago

Thanks, I think it's difficult for me to modify the code. Can I just calculate the average of the normxpehh reported by the normalization of scores for each window?

best

szpiech commented 8 years ago

Surely you could do that, but I don't know if it is appropriate for the analysis you have in mind.

biozzq commented 8 years ago

I just want to detect selection across genome. Before this, I have calculated fst and pi using overlapped sliding windows, so I want to have a consistent window size with fst and pi which will be convenient to compare among them. Do you think it is appropriate for me?

Best

szpiech commented 8 years ago

Well, I haven't explored how well the mean XPEHH score performs as a way to identify putatively selected regions of the genome. A more frequent way is to search for clusters of extreme scores, as implemented in norm (unfortunately, for the time being, this uses non-overlapping windows only). If you choose to compute mean scores within a window, you should compare windows with similar densities of scores or define your windows with respect to a fixed number of scores.

biozzq commented 8 years ago

Hi Thanks for your suggestions. I will use the approach to search for clusters of extreme scores with non-overlapping window using norm. But after running the command norm --xpehh --files *xpehh.out --bp-win --winsize 50000, I can not interpret the results about the last column and second last column. Does the number 1 and 5 in the last column mean significance level and represent undergoing selection in any of population? Because I don't find the minus scores in the second last column.

The second question is that how to find the core SNPs which lead to the difference between two populations in LD pattern. I can't see apparent difference in the same region between two populations, while the last two columns are both "1". I believe that you can help me to understand the results after normalization by non-overlapping window such as which style graph I should use to detail the regions being selected.

Thanks

szpiech commented 8 years ago

Sorry about this, the output file generated from the --bp-win option is formatted as

<window start position> <window end position> <number of SNPs with iHS scores in win> <fraction of scores |iHS|>2> <approximate percentile of window>

So, the windows with a final column <= 1 are in the top 1 percent genome wide, and could be treated as putative regions under selection. I believe windows with no SNPs have a reported percentile of -1, you can treat those as missing data. The strongest evidence for selection will be found in regions with clusters of these windows.

It is a bit difficult to localize the exact core SNP that is causative (if it exists at all), but often one proposes the position with the most extreme score in the window. Once the windows putatively under selection are identified, I would extract the xpehh scores from each region and plot them as a function of physical position, identify the most extreme scores, and then examine the underlying haplotypes. If your organism is in the UCSC Genome Browser, it's worth examining what annotations in the vicinity of these windows.

Hope that helps!

ColyFu commented 7 years ago

Dear Szpiech,

I have normed my iHS result with parameters --ihs --files Pp0*.out --bp-win --winsize 20000. But I found that even a smaller fraction of extreme ihs score has a higher percentile: As you see, the output in the screen shot below was sorted by fraction of scores |iHS|>2.

Could you please tell me how selscan calculate the percentile, maybe I have missed something. Thanks!

Fu Jun

szpiech commented 7 years ago

When doing the --bp-win analysis, the windows are first binned into quantiles based on the number of SNPs within the window, and then the percentiles are determined within each quantile bin. If we don't do this, we would likely bias "significant" windows towards those with fewer SNPs. I believe the current default number of quantiles is set to 20, but you can change the number with --qbins flag. I frequently use a value of 10 when my windows are 100kbp. The percentile cutoffs per bin are output to the logfile for your inspection and would look something like this

nSNPs 1.0 5.0
85 0.641842 0.329203
115 0.578672 0.287208
138 0.5227 0.270483
158 0.493293 0.227586
177 0.426161 0.194228
197 0.401928 0.19898
218 0.388357 0.202249
245 0.374065 0.185765
286 0.344427 0.178946
1700 0.338572 0.16341

In this case, the first column gives the maximum number of SNPs for a window to be included in a given quantile, e.g. the second quantile bin contains all windows that contain between (85,115] SNPs. The second quantile gives the fraction cutoff for the approximate 1% percentile and the third column gives the approximate 5% percentile cutoff.

Hope this helps!

ColyFu commented 7 years ago

Thanks for your clear explanation. I found the 5%(or 1%) percentile outlier windows are equal assigned to each quantile bin. I don't know if it‘s reasonable for giving an equal weight for each quantile bin. To my knowledge, if a region have beed selected, it should have a lower diversity, in other words less SNPs?

szpiech commented 7 years ago

Sure, you would expect fewer segregating sites after a sweep, although iHS has best power when the sweep is still incomplete. With genotyping array data, I would worry about the non-uniform distribution of sites, and with WGS this might be less of an issue, although I would still consider the problem carefully.

If you don't want to bin the windows by number of SNPs, you can use --qbins 1.

ColyFu commented 7 years ago

Thank you so much. It's useful for me, I'll try it.

szpiech / selscan

Normalize XPEHH with overlap window #23