selective sweep tests in inbred lines

fersall commented 1 year ago

Hello, I am doing a whole genome scan for signals of positive selection in two populations of inbred lines. I have performed the xp-ehh and xp-nsl tests. The jobs did finish without any error except these two types of warning messages in the log file.

WARNING: Reached chromosome edge before EHH decayed below 0.05. Skipping calculation at position 5666 id: .

WARNING: Reached a gap of 516027bp > 200000bp. Skipping calculation at position 1680607 id: .

Since these samples are inbred lines the long haplotypes distances may be very large. Currently I am using default values for --cutoff and for --max-gap. So, my question is whether should I increase the value for those parameters in order to solve that issue?

Thanks

szpiech commented 1 year ago

Hello,

You could certainly increase both of these parameters, and you should encounter fewer of these warnings. I usually hesitate to increase the --max-gap parameter unless you have generally sparse data. Perhaps this is the case for you, since you mention you're using highly inbred lines I imagine there are fewer variants called (and they are probably farther apart).

You could theoretically also use --trunc-ok, which would have the effect of integrating the scores at the sites that otherwise encounter those warnings, but I typically don't recommend this option unless you are evaluating short simulated regions or you absolutely need a score at an otherwise filtered site.

Hope this helps,

Zachary

On Mon, Oct 30, 2023 at 12:47 PM fersall @.***> wrote:

Hello, I am using doing a whole genome scan for signals of positive selection in two populations of inbred lines. I have performed the xp-ehh and xp-nsl tests. The jobs did finish without any error except these two types of warning messages in the log file.

WARNING: Reached chromosome edge before EHH decayed below 0.05. Skipping calculation at position 5666 id: .

WARNING: Reached a gap of 516027bp > 200000bp. Skipping calculation at position 1680607 id: .

Since these samples are inbred lines the long haplotypes distances may be very large. Currently I am using default values for --cutoff and for --max-gap. So, my question is whether should I increase the value for those parameters in order to solve that issue?

Thanks

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/105, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQRB3UXPBI4XPUV5Y3LYB7K3FAVCNFSM6AAAAAA6WKXI56VHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3DQOBQHAYTAOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

fersall commented 1 year ago

Thanks a lot for your comments Zachary! I did increase the --cutoff to 0.10 and certainly I reduced the number of those warnings from 432 to 431. I did not change --max-gap parameter since my genome wide SNP data set is quite dense, ~800K filtered by position to keep a minimal distance of 1K between each marker. Also, my XPEHH values range from -0.67 to 0.50 across the genome (please see my attached plot) fem_allChr_xp-ehh_4r2_ed . In the literature it is considered as signature of positive selection those values higher than 2. So, this short range (of xpehh values). Is it normal? Or could it be due to lack of power of the test to detect signal of selection on my data?

Best

Fernando

fersall commented 1 year ago

Sorry, closed the post my mistake!

fersall commented 1 year ago

Hello again, I just realized that the built in program norm has to be executed after running --xpehh. I understood from the manual that the results were already standardized with that program. So, I just ran it and my results looks much better now! Is it a general rule to consider a positive signature of selection those values higher than 2? Or is it mostly arbitrary?

Also, I understand that it is much powerful to look for windows of consecutive SNPs with higher EHH scores rather than consider each SNP individually. Is there a way to achieve that in Selscan?

Thanks a lot for you help and I am sorry for asking many newbie questions!

szpiech commented 1 year ago

Hi,

Glad you found norm! So with norm you can also use the --bp-wins flag to examine clusters of extreme scores either <-2 or >2. I wouldn't take a score ~2 alone very seriously, but if you found 90% of scores in a region

2 this is more compelling. Since you're using an XP stat, there will be two "directions" to consider, putative selection in the "ref" population (negative scores) and putative selection in the "non-ref" population (positive scores).

Let me know if you have any issues.

Zachary

On Wed, Nov 1, 2023 at 5:39 PM fersall @.***> wrote:

Hello again, I just realized that the built in program norm has to be executed after running --xpehh. I understood from the manual that the results were already standardized with that program. So, I just ran it and my results looks much better now! Is it a general rule to consider a positive signature of selection those values higher than 2? Or is it mostly arbitrary?

Also, I understand that it is much powerful to look for windows of consecutive SNPs with higher EHH scores rather than consider each SNP individually. Is there a way to achieve that in Selscan?

Thanks a lot for you help and I am sorry for asking many newbie questions!

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/105#issuecomment-1789728299, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQWRXGZGRIKDL7ZMWLLYCK6SNAVCNFSM6AAAAAA6WKXI56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBZG4ZDQMRZHE . You are receiving this because you commented.Message ID: @.***>

fersall commented 1 year ago

Hi Zachary, I do have a question regarding the 2 directions to consider that you mentioned. First I want to give you a little more context. My "ref" population includes old lines (lets say founders lines), while the "non-ref" one includes modern lines. In case of putative selection in the "ref" population (negative scores). One would expect that the focal allele would increase in frequency in the "non-ref" population because of they were selected for their beneficial effect on that population. While in the case of putative selection in the "non-ref" population (positive scores) it would be the opposite. Am I correct?

szpiech commented 1 year ago

Hello,

I think you made a typo, so I will fix it for clarity below.

In case of putative selection in the "ref" population (negative scores). One would expect that the focal allele would increase in frequency in the "ref" population because they were selected for their beneficial effect on that population. While in the case of putative selection in the "non-ref" population (positive scores) it would be the opposite.

So, with this typo fixed, yes this is correct.

-Zachary

On Tue, Nov 7, 2023 at 9:33 AM fersall @.***> wrote:

Hi Zachary, I do have a question regarding the 2 directions to consider that you mentioned. First I want to give you a little more context. My "ref" population include old lines (lets say founders lines), while the "non-ref" one include modern lines. In case of putative selection in the "ref" population (negative scores). One would expect that the focal allele would increase in frequency in the "non-ref" population because of they were selected for their beneficial effect on that population. While in the case of putative selection in the "non-ref" population (positive scores) it would be the opposite. Am I correct?

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/105#issuecomment-1798665426, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQVRHUEDZ24ZF2VQG7LYDJBEVAVCNFSM6AAAAAA6WKXI56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJYGY3DKNBSGY . You are receiving this because you commented.Message ID: @.***>

fersall commented 12 months ago

Hi Zachary, I run this command norm --xpehh --bp-win --winsize 100000 --files *.xpehh.out and I got the output

1   100001  0   -1  -1  -1  -1  NA  NA
100001  200001  0   -1  -1  -1  -1  NA  NA
200001  300001  34  0   0.0588235   100 5   -0.544703   -2.06226
300001  400001  32  0   0.53125 100 5   -0.573806   -2.86754
400001  500001  32  0   0   100 100 0.00109769  -0.850957
500001  600001  40  0   0   100 100 -0.287803   -1.1771

I found out in one of the questions you answered previously that the headers of each column correspond to:

<win start> <win end> <# scores in win> <frac scores gt threshold> <frac scores lt threshold> <approx percentile for gt threshold wins> <approx percentile for lt threshold wins> <max score> <min score>

And also, that <frac scores gt threshold> corresponds to "the fraction of XP-EHH scores =>2". But I am not clear what is <frac scores lt threshold>, Do they correspond to the reference and non-ref pop, respectively? I am sorry if that is a dumb question! I just want to make sure I am interpreting the results correctly.

Thanks

szpiech commented 12 months ago

Hi,

No worries, this one is <=-2.

Zachary

Le lun. 20 nov. 2023 à 5:27 PM, fersall @.***> a écrit :

Hi Zachary, I run this command norm --xpehh --bp-win --winsize 100000 --files *.xpehh.out and I got the output

1 100001 0 -1 -1 -1 -1 NA NA 100001 200001 0 -1 -1 -1 -1 NA NA 200001 300001 34 0 0.0588235 100 5 -0.544703 -2.06226 300001 400001 32 0 0.53125 100 5 -0.573806 -2.86754 400001 500001 32 0 0 100 100 0.00109769 -0.850957 500001 600001 40 0 0 100 100 -0.287803 -1.1771

I found out in one of your questions answered previously that the headers of each column correspond to:
<# scores in win> And also, that corresponds to "the fraction of XP-EHH scores =>2". But I am not clear what is , Do they correspond to the reference and non-ref pop, respectively? I am sorry if that is a dumb question! I just want to make sure I am interpreting the results correctly. Thanks — Reply to this email directly, view it on GitHub , or unsubscribe . You are receiving this because you commented.Message ID: ***@***.***>

szpiech / selscan

selective sweep tests in inbred lines #105