szpiech / selscan

Haplotype based scans for selection
GNU General Public License v3.0
109 stars 33 forks source link

how to understand the norm ihh12 ouput #56

Closed willright28 closed 3 years ago

willright28 commented 3 years ago

Hi @szpiech : I'm new to your very useful software selscan, it's great. But I have some questions about the result of ihh12 normalization. Here is the thing, I run the 'selscan' & 'norm' funtions per scaffold, and the 'norm' result has 4 column, "win_start","win_end","min-maxnormalization normalized-ihh12","top percentile", right? Because the value of 3th column is between (-1,1), wihch makes the scatter plot looked werid, I'm wondering could I get a "unnormalized" normalized-ihh12 by using 'norm'? or if there was another way to do it. Second question: I count the number of "TOP1" and "TOP5" in my result, however the former is many times than the latter, how to understand this? Please correct me if I was wrong, and thanks for your time and help in advances!

szpiech commented 3 years ago

Hello, first, i recommend normalizing all scaffolds jointly. But I'm not entirely sure what you mean by an unnormalized normalized ihh12. I seem to recall ihh12 is correlated with allele frequency at the test snp, which is why normalization is a good idea.

Regarding the TOP1 and TOP5 results, the thresholds are computed for windows binned by number of snps, and the TOP1 TOP5 windows are estimated based on the quantiles within each bin. When you have very few windows in a bin the total numbers can look a little wacky (e.g. in a pathological case of all 0s the top 1% quantile score will be 0 and all windows will be called as TOP1). If you normalize all scaffolds together, this will help with that problem, and you might also consider changing the number of bins with the --qbins flag. Hope this helps!