szpiech / selscan

Haplotype based scans for selection
GNU General Public License v3.0
111 stars 33 forks source link

The results are closely related with the monomorphic sites #89

Open biozzq opened 2 years ago

biozzq commented 2 years ago

Hi @szpiech

I tried running selscan (v2.0.0 xpnsl) using two different variant filtering processes, however over 50% windows in which enriched for extreme scores in these two cases are non-overlapped.

As I have a joint calling VCF which recording the variants across many different breeds and sub-species. When subsetting the samples I focused on, I used two different methods to generate the combined VCF in which only contains my focused samples. 1、bcftools view -S foucs_sample_id 2、bcftools view -S focus_sample_id -Ou | bcftools view -i 'MAC > 0'

The main different among above two commands is that the first one contains the monomorphic sites but the second does not. Here (https://github.com/szpiech/selscan/issues/59) , you suggested that I should remove the monomorphic sites in the combined VCF (with above second command). However, I found that you have kept those sites in your paper https://www.biorxiv.org/content/10.1101/2020.05.19.104380v2.full.

Looking forward to your reply, thank you in advance. Best wishes, Zheng zhuqing

szpiech commented 2 years ago

Hi Zheng Zhuqing,

On reflection, the choice to filter monomorphic sites actually has different consequences for XP-EHH vs XP-nSL. For XP-EHH it should not matter, as the distance is measured in either bps or cm. For XP-nSL it can actually change the distances, since we count in # of sites. It would seem to me, then, that if we know a site is polymorphic in another population (beyond the two being analyzed) this may provide important information for the statistic.

Another thing to consider is that these statistics ultimately rely on an outlier approach for identifying interesting signals. If the results change that much between two successive runs with slightly different inputs, then I would advise treating only the intersecting regions as putative sweep locations. You could go further and explore how many of the top 5% of windows remain in the top 5%, etc.

Hope this helps,

Zachary

On Wed, Sep 28, 2022 at 9:30 AM biozzq @.***> wrote:

Hi @szpiech https://github.com/szpiech

I tried running selscan (v2.0.0 xpnsl) using two different variant filtering processes, however over 50% windows in which enriched for extreme scores in these two cases are non-overlapped.

As I have a joint calling VCF which recording the variants across many different breeds and sub-species. When subsetting the samples I focused on, I used two different methods to generate the combined VCF in which only contains my focused samples. 1、bcftools view -S foucs_sample_id 2、bcftools view -S focus_sample_id -Ou | bcftools view -i 'MAC > 0'

The main different among above two commands is that the first one contains the monomorphic sites but the second does not. Here (#59 https://github.com/szpiech/selscan/issues/59) , you suggested that I should remove the monomorphic sites in the combined VCF (with above second command). However, I found that you have kept those sites in your paper https://www.biorxiv.org/content/10.1101/2020.05.19.104380v2.full.

Looking forward to your reply, thank you in advance. Best wishes, Zheng zhuqing

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/89, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQRUQAYD4MARMOYPL7DWARB6DANCNFSM6AAAAAAQXZBSTY . You are receiving this because you were mentioned.Message ID: @.***>

biozzq commented 2 years ago

Thank you. I think it's better for me to generate AllSites VCF (contain whole genome invariant sites in addition to variant sites) when performing joint population variants calling. When running XP-nSL, these monomorphic sites can provide information for the statistic. Also, these sites are essential for the correct computation of π and dxy.

However, when running XP-EHH and iHS, it's better to filter out those monomorphic sites across the population being analyzed.

Is my understanding correct?

Best, Zheng zhuqing

szpiech commented 2 years ago

For XP-nSL, if you know there is a polymorphism at the locus in that species, it may be worth keeping the site (even if it is monomorphic in the two populations you are comparing), as it may provide information (since it influences the distance measure). Although I have not tested this in detail, so I do not know if it ultimately helps or hurts. Retaining them seems reasonable, though.

For XP-EHH, since the distance is measured in bp or cm, these sites should make no difference (other than making the software run longer), so you may as well filter them.

-Zachary

On Thu, Sep 29, 2022 at 9:42 PM biozzq @.***> wrote:

Thank you. I think it's better for me to generate AllSites VCF (contain whole genome invariant sites in addition to variant sites) when performing joint population variants calling. When running XP-nSL, these monomorphic sites can provide information for the statistic. Also, these sites are essential for the correct computation of π and dxy.

However, when running XP-EHH and iHS, it's better to filter out those monomorphic sites across the population being analyzed.

Is my understanding correct?

Best, Zheng zhuqing

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/89#issuecomment-1263007544, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQQMR5DRZEB7T36AMQDWAZARJANCNFSM6AAAAAAQXZBSTY . You are receiving this because you were mentioned.Message ID: @.***>