szpiech / selscan

Haplotype based scans for selection
GNU General Public License v3.0
109 stars 33 forks source link

Some questions about Selscan #34

Closed Yun-HongWu closed 5 years ago

Yun-HongWu commented 5 years ago

Dear Szpiech,

Recently, i am doing a genome-wide scans of posotive selections on my data sets with your software "Selscan". In my analysis, i have used the --nSL to identifying the candidate selected regions (the input file is phased VCF files), and following your manuals i have got the output files (selscan+norm pipeline). But now there are some questions confused me and need your help. First, when i use command "norm --nsl --files <file1..out> ... <fileN..out>" to normalize selscan output across frequency bins, the output files generated by norm have no header information about each column. Such as . 180387 0.0571429 100.917 25.766 1.36524 1.54632 0 . 181142 0.385714 28.9231 49.4042 -0.535396 -0.191878 0 . 182751 0.228571 55.5 33.1852 0.514279 1.15106 0 . 183083 0.385714 29.5385 49.5183 -0.516649 -0.16627 0 . 184456 0.371429 30.8031 47.1786 -0.426327 -0.048742 0

and what is the mean for the last column (such as the value "0"). Second, if i use the command "norm --nsl --files <file1..out>...<fileN..out> --bp-win --winsize 20000" to normalize my selscan results, the formats of the normalized files are like this: 1 20001 0 -1 -1 20001 40001 0 -1 -1 40001 60001 0 -1 -1 60001 80001 0 -1 -1 80001 100001 0 -1 -1 100001 120001 0 -1 -1 120001 140001 0 -1 -1 140001 160001 0 -1 -1 160001 180001 0 -1 -1 180001 200001 19 0 100 200001 220001 8 0 -1 220001 240001 89 0 100 240001 260001 22 0.0454545 100 260001 280001 30 0 100

for the first two columns, i know they are the positions of each window, but i have no idea about the rest three columns, what do they mean?

The last question is how do i identify the candidate regions of positive selections based on the output files of Selscan? For example, if i use the normalized output file generated by norm (without --bp-win parameter), should i just pick out the loci as the candidates of postive selections ,for which their normalized values fall in the extreme 1% or 99% ranges? And if i use the normalized results generated by norm with --bp-win and --winsize, what is the best way to pick out the candidate loci/regions from the genome-wide windowed values?

Looking forward your replies ! Thank you in advance

szpiech commented 5 years ago

For the output you give first the format is <snp id> <position> <allele frequency of '1' allele> <SL1> <SL0> <raw nSL> <normalized nSL> <is nSL greater than critical value? 1 yes; 0 no>

For the second output, the format is <start window> <end window> <nScores in window> <fraction grater than critical value> <percentile relative to other winows in nScores bin>

You would typically take the windows in the top 0.1 or 1 percent as putative regions under ongoing positive selection.

Hope this helps.