szpiech / selscan

Haplotype based scans for selection
GNU General Public License v3.0
109 stars 33 forks source link

The question about 'selscan' and 'norm' result #36

Closed Wangzezhao closed 5 years ago

Wangzezhao commented 5 years ago

Dear Prof. Szpiech ZA,

My name is Wang Zezhao, a graduate student in Institute of animal sciences of Chinese academy of agricultural sciences, Beijing China. I used ‘selscan’ software to calculate the iHS value across the cattle genome for selection signal.

We obtained a total 478,903 SNPs with estimated iHS scores (selscan --ihs --hap chr1.hap --map chr1.map --out chr1_result). All of SNP sites were normalized using ‘norm’software and then were used for the identification of candidate regions (norm --ihs --files chr1_result.ihs.out --bp-win).

In the output file from ‘norm’, 24,820 regions were detected (Attachment 1). We found the last column of the output file contains four indicating variables (-1, 1, 5, 100). Could you please tell me how to define indicator variables (1, 5, 100)? Indicator variable ‘1’, does it represents the threshold of top 1%?

In our analysis, only 24,820 non-overlapping windows (100 kb) were examined, but based on the indicator variable ‘1’, we totally identified 710 regions. So we feel confused the results, does the 710 of those were in the top 1%? top 1% would be 248 windows according the total of 24,820 non-overlapping windows, so I'm not sure what the top 1% criteria means.

Could you please help me to explain this, or do we need any other concern for our results.

We are very appreciate any help from you.

Thank you for your time, and I look forward to hearing from you.

Best wishes !

Zezhao Wang
result.ihs.100bins.norm.100kb.merge.windows.txt

szpiech commented 5 years ago

Hello,

Thanks for your message. You are correct that the (1,5,100) column is supposed to represent the percentile of the window for highest fraction of extreme scores (-1 is there to indicate there were too few snps in the window to consider). You are likely seeing way more than 1% for a few reasons. I notice there are a lot of ties in the fraction column. If the 1% threshold is at a tie you could easily get many windows marked as in the 1% tail. Second, the windows are binned by number of snps and the percentiles are computed within each bin, so the above problem can happen multiple times. Although I have to admit I've never seen such a large proportion marked as 1.

To correct this problem, I think you can do two things. First it seems you may have relatively low density snps along your genome, so I would recommend making your window bigger, perhaps 200kb to start (using --winsize). I would also suggest lowing the number of bins used for the quantile estimation, perhaps down to 10 from the default of 20, using --qbins.

Hope this helps,

Zachary

Wangzezhao commented 5 years ago

Dear Prof., I would like to express my thanks for your help. Your answer is very helpful in extending my understanding of the output and I will try the way you mentioned. Again, I really appreciate your kindness. Thank you, Zezhao

On 5/6/2019 21:50,Zachary A Szpiechnotifications@github.com wrote:

Hello,

Thanks for your message. You are correct that the (1,5,100) column is supposed to represent the percentile of the window for highest fraction of extreme scores (-1 is there to indicate there were too few snps in the window to consider). You are likely seeing way more than 1% for a few reasons. I notice there are a lot of ties in the fraction column. If the 1% threshold is at a tie you could easily get many windows marked as in the 1% tail. Second, the windows are binned by number of snps and the percentiles are computed within each bin, so the above problem can happen multiple times. Although I have to admit I've never seen such a large proportion marked as 1.

To correct this problem, I think you can do two things. First it seems you may have relatively low density snps along your genome, so I would recommend making your window bigger, perhaps 200kb to start (using --winsize). I would also suggest lowing the number of bins used for the quantile estimation, perhaps down to 10 from the default of 20, using --qbins.

Hope this helps,

Zachary

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.