szpiech / selscan

Haplotype based scans for selection
GNU General Public License v3.0
114 stars 33 forks source link

sample size for XP-EHH #104

Open cgdmkns opened 1 year ago

cgdmkns commented 1 year ago

Dear @szpiech,

Thank you for developing this software. I'm new in this and trying to use XPEHH for my sister taxa, on which I have some questions. For the vcf files we use, is it enough to remove variants with minor allele count less than 2 (—mac 2) and filter out anything but biallelic? Also I wonder how the sample size affects the XPEHH scores. If there is a large sample size difference between the reference population and the focal population, is this a problem? The other question is: What does the large values for the scores mean? Do they reflect the strength of the selection signature since a larger genomic region is influenced by the sweep? My last question: For the search of clusters with extreme scores implemented in norm, as far as I understood you state "approx percentile for gt threshold wins" column represents approximate percentile of scores greater than 2 in that window. While plotting these extreme score regions in genome, after identifying these windows with extreme scores with norm nonoverlapping command, is it correct to get the normxpehh values of these windows (from the norm result without window) and plot them as a function of position? Sorry if this is something silly to ask : )

Thank you! Cigdem

szpiech commented 1 year ago

Hello,

Thanks for your kind words. Let me try to answer below.

For the vcf files we use, is it enough to remove variants with minor allele count less than 2 (—mac 2) and filter out anything but biallelic?

In my testing I've found that not filtering any sites for MAF works best for the XP statistics, but yes you should filter out anything that isn't biallelic.

Also I wonder how the sample size affects the XPEHH scores. If there is a large sample size difference between the reference population and the focal population, is this a problem?

I haven't carefully explored this question, but my hunch is that if you have a sample that is much smaller than the other, the raw XP statistic would be slightly biased in favor of that population. However, this should be a genome-wide mean effect, so normalizing ought to at least take care of that shift. However, I'm not sure how much it might influence power.

The other question is: What does the large values for the scores mean? Do they reflect the strength of the selection signature since a larger genomic region is influenced by the sweep?

An extreme negative score means that there are relatively long high-frequency haplotypes in that region of the genome in your reference population compared to your other population. Extreme positive scores mean the same thing but in the other population compared to the reference. While I think selection coefficient is likely related to the score, the precise relationship is not entirely clear. I would hesitate to draw too strong a conclusion based on the magnitude of the scores alone.

My last question: For the search of clusters with extreme scores implemented in norm, as far as I understood you state "approx percentile for gt threshold wins" column represents approximate percentile of scores greater than 2 in that window. While plotting these extreme score regions in genome, after identifying these windows with extreme scores with norm nonoverlapping command, is it correct to get the normxpehh values of these windows (from the norm result without window) and plot them as a function of position?

Sure, you can do this. See figure 6 in https://academic.oup.com/evlett/article/5/4/408/6697682 for an example.

Hope this helps,

Zachary

On Thu, Sep 21, 2023 at 3:04 AM cgdmkns @.***> wrote:

Dear @szpiech https://github.com/szpiech,

Thank you for developing this software. I'm new in this and trying to use XPEHH for my sister taxa, on which I have some questions. For the vcf files we use, is it enough to remove variants with minor allele count less than 2 (—mac 2) and filter out anything but biallelic? Also I wonder how the sample size affects the XPEHH scores. If there is a large sample size difference between the reference population and the focal population, is this a problem? The other question is: What does the large values for the scores mean? Do they reflect the strength of the selection signature since a larger genomic region is influenced by the sweep? My last question: For the search of clusters with extreme scores implemented in norm, as far as I understood you state "approx percentile for gt threshold wins" column represents approximate percentile of scores greater than 2 in that window. While plotting these extreme score regions in genome, after identifying these windows with extreme scores with norm nonoverlapping command, is it correct to get the normxpehh values of these windows (from the norm result without window) and plot them as a function of position? Sorry if this is something silly to ask : )

Thank you! Cigdem

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/104, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQTBP6XUDDN3DBQ2PD3X3PRHRANCNFSM6AAAAAA5BBGJTU . You are receiving this because you were mentioned.Message ID: @.***>

cgdmkns commented 1 year ago

That's an explicit answer, I appreciate your help. I'll also run iHS on the same data to identify the sweeps that are still ongoing. Is your comment on MAF valid for that also? Do you recommend getting the absolute values for all iHS scores (derived and ancestral) after norm and use them to interpret or is it more reliable to evaluate the derived and ancestral scores separate? Thanks!

szpiech commented 1 year ago

Hello,

For the single population statistics, like iHS, it is usually best to filter low MSF sites. selscan ought to do this by default for you. Yes, the absolute value of the scores is the best to use. This is because you may have linked ancestral sites on your selected haplotype, which would then give an iHS score with opposite sign.

Zachary

On Fri, Sep 22, 2023 at 2:21 AM cgdmkns @.***> wrote:

That's an explicit answer, I appreciate your help. I'll also run iHS on the same data to identify the sweeps that are still ongoing. Is your comment on MAF valid for that also? Do you recommend getting the absolute values for all iHS scores (derived and ancestral) after norm and use them to interpret or is it more reliable to evaluate the derived and ancestral scores separate? Thanks!

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/104#issuecomment-1730865967, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQR74NR5IVKOATOQJCTX3UU45ANCNFSM6AAAAAA5BBGJTU . You are receiving this because you were mentioned.Message ID: @.***>

cgdmkns commented 1 year ago

Thank you for all the answers!

Best wishes! Cigdem