szpiech / selscan

Haplotype based scans for selection
GNU General Public License v3.0
107 stars 33 forks source link

Working with large number of SNPs and few different haplotypes #77

Closed RamGonzalez closed 2 years ago

RamGonzalez commented 2 years ago

Hi professor Szpiech,

Thank you for such an awesome and simple to use tool for computing selection statistics! I have a couple of questions on how should I go about using selscan when having a large number of SNPs but a low number of different haplotypes:

I'm calculating iHS, nSL and XPEHH on different whole genome data sets (around 15-25 samples per data set). I noticed an flag in the manual for rehh called freqbin for iHS when approaching a similar issue:

"freqbin: Size of the bins to standardize log(iHH_A/iHH_D). (...) If set to 0, standardization is performed considering each observed frequency as a discrete frequency class (useful in case of a large number of markers and few different haplotypes)"

  1. Is there a similiar option or strategy to this on selscan?

  2. So far, in iHS for example, I'm normalizing using: norm --ihs --bins 20 --files [all .ihs.out files per chr] but I'm hesitant to use 20 bins since almost half of them come out empty, should I lower the number of bins? maybe down to 1?

Best, Ram

szpiech commented 2 years ago

Hello,

Thanks for the kind words. I haven’t implemented this precise feature in norm, but you can arrive at the same thing by setting qbins to a large number. There will be a lot of empty bins but since they are empty they won’t affect anything.

Hope this helps,

Zachary

Le mer. 30 mars 2022 à 1:16 PM, Ram Gonzalez @.***> a écrit :

Hi professor Szpiech,

Thank you for such an awesome and simple to use tool for computing selection statistics! I have a couple of questions on how should I go about using selscan when having a large number of SNPs but a low number of different haplotypes:

I'm calculating iHS, nSL and XPEHH on different whole genome data sets (around 15-25 samples per data set). I noticed an flag in the manual for rehh called freqbin for iHS when approaching a similar issue:

"freqbin: Size of the bins to standardize log(iHH_A/iHH_D). (...) If set to 0, standardization is performed considering each observed frequency as a discrete frequency class (useful in case of a large number of markers and few different haplotypes)"

1.

Is there a similiar option or strategy to this on selscan? 2.

So far, in iHS for example, I'm normalizing using: norm --ihs --bins 20 --files [all .ihs.out files per chr] but I'm hesitant to use 20 bins since almost half of them come out empty, should I lower the number of bins? maybe down to 1?

Best, Ram

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/77, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQXFJTGBKLEHOL77XYTVCSEARANCNFSM5SCT3FLA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

szpiech commented 2 years ago

Hello,

I meant you should set —bins to a high number, not qbins. Sorry for any confusion!

Zachary

Le mer. 30 mars 2022 à 4:54 PM, Zachary Szpiech @.***> a écrit :

Hello,

Thanks for the kind words. I haven’t implemented this precise feature in norm, but you can arrive at the same thing by setting qbins to a large number. There will be a lot of empty bins but since they are empty they won’t affect anything.

Hope this helps,

Zachary

Le mer. 30 mars 2022 à 1:16 PM, Ram Gonzalez @.***> a écrit :

Hi professor Szpiech,

Thank you for such an awesome and simple to use tool for computing selection statistics! I have a couple of questions on how should I go about using selscan when having a large number of SNPs but a low number of different haplotypes:

I'm calculating iHS, nSL and XPEHH on different whole genome data sets (around 15-25 samples per data set). I noticed an flag in the manual for rehh called freqbin for iHS when approaching a similar issue:

"freqbin: Size of the bins to standardize log(iHH_A/iHH_D). (...) If set to 0, standardization is performed considering each observed frequency as a discrete frequency class (useful in case of a large number of markers and few different haplotypes)"

1.

Is there a similiar option or strategy to this on selscan? 2.

So far, in iHS for example, I'm normalizing using: norm --ihs --bins 20 --files [all .ihs.out files per chr] but I'm hesitant to use 20 bins since almost half of them come out empty, should I lower the number of bins? maybe down to 1?

Best, Ram

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/77, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQXFJTGBKLEHOL77XYTVCSEARANCNFSM5SCT3FLA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

RamGonzalez commented 2 years ago

Hello,

Thanks for your reply! I'll run some tests with different and bigger bin sizes as per your suggestion.

Best, Ram

szpiech commented 2 years ago

Hi Ram,

To be clear, setting —bins higher creates more bins not larger bins. So if you set it high enough you will necessarily force one bin per observed allele frequency.

Zach

Le jeu. 31 mars 2022 à 1:54 PM, Ram Gonzalez @.***> a écrit :

Hello,

Thanks for your reply! I'll run some tests with different and bigger bin sizes as per your suggestion.

Best, Ram

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/77#issuecomment-1084926057, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQRISDDQCLJNUDTAAN3VCXRFXANCNFSM5SCT3FLA . You are receiving this because you commented.Message ID: @.***>

RamGonzalez commented 2 years ago

Hi professor Szpiech!

So far I've looked at iHS results with the option --bins set to 20, 100, 500 and 1000. The biggest difference I've seen is between 20 and all of the other ones, with slightly higher |iHS| values when using more bins, and just a bit more well defined peaks of signals (less noise reduction). Although I was expecting to see a more noticeable reduction in noise and maybe a decrease on the value of the statistic.

With 25 individuals and normalizing with all previously tested bin values I get an |iHS| around 15, which I'm thinking could be instead due to the low number of samples or the nature of the statistic being used. Good practices in normalization like the one you suggested are essential nonetheless, I'm thinking on going over 1000 bins and see how that changes the results.

Thanks again for your follow-up. Best, Ram