szpiech / selscan

Haplotype based scans for selection
GNU General Public License v3.0
115 stars 33 forks source link

Problem calculating iHS with 1000G 30x #102

Open pclavell opened 1 year ago

pclavell commented 1 year ago

Hello, I am using selscan to compute iHS with the 1000G 30x data.

About the 1000G 30x data I only used:

For computing the unstandarized iHS I used a genetic map that I expanded to include all my variants (using plink)

Then I normalized all the autosomes together by 200 bins of derived allele frequency.

I wanted to compare my results with the 1000G Selection Browser data (based on phase I) and I found that despite Fst was highly correlated (pearson's corr >0.97), iHS doesn't correlate at all (pearson's corr = 0.06). I didn't understand why my results are not correlated with the ones from the 1000G Selection Browser (based on 1000G phase I) so I decided to check on well-known positively selected sites in Europeans: LCT, SLC24A5 and SLC45A2. I add a picture of each chromosome containing these sites showing that not even LCT has the highest |iHS| in its own chromosome and the rest of genes don't seem especially high compared to the background.

I hope you can spot some error or give some recommendation, meanwhile I'm going to compute XP-EHH (CEU-YRI) to see what happens. Thanks a lot

image image image PS: I am aware that the common threshold is 2 but this is just to remove more background...

szpiech commented 1 year ago

Hello,

I'm not sure why you have such a low correlation. I seem to recall that the low-coverage and high-coverage calls used different genome builds. I assume you accounted for that? Your LCT signal looks good. The height of the peak isn't as important as the enrichment of extreme scores in the region for iHS. I'm not sure what version of the original XP-EHH software the selection browser people used. I will note that I found an error in the computation in that original code, which the author did fix, but I think it was around the time the selection browser paper was being worked on. So that's something to keep in mind (although even with the old incorrect scores they should be reasonably correlated).

-Zachary

On Mon, Sep 18, 2023 at 9:29 AM Pau Clavell Revelles < @.***> wrote:

Hello, I am using selscan to compute iHS with the 1000G 30x data.

About the 1000G 30x data I only used:

  • Unrelated individuals (from CEU population)
  • Phased and polarized biallelic variants

For computing the unstandarized iHS I used a genetic map that I expanded to include all my variants (using plink)

Then I normalized all the autosomes together by 200 bins of derived allele frequency.

I wanted to compare my results with the 1000G Selection Browser data (based on phase I) and I found that despite Fst was highly correlated (pearson's corr >0.97), iHS doesn't correlate at all (pearson's corr = 0.06). I didn't understand why my results are not correlated with the ones from the 1000G Selection Browser (based on 1000G phase I) so I decided to check on well-known positively selected sites in Europeans: LCT, SLC24A5 and SLC45A2. I add a picture of each chromosome containing these sites showing that not even LCT has the highest |iHS| in its own chromosome and the rest of genes don't seem especially high compared to the background.

I hope you can spot some error or give some recommendation, meanwhile I'm going to compute XP-EHH (CEU-YRI) to see what happens. Thanks a lot

[image: image] https://user-images.githubusercontent.com/99911796/268658763-28150847-e585-4e8a-a439-12a2f21adbee.png [image: image] https://user-images.githubusercontent.com/99911796/268659002-cf75c259-c0a6-42a7-9927-9603216c86d1.png [image: image] https://user-images.githubusercontent.com/99911796/268659082-627540ee-bb93-4684-8e71-d805d8679020.png PS: I am aware that the common threshold is 2 but this is just to remove more background...

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/102, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQXMUGYQAUYR2WBTVHTX3BEDPANCNFSM6AAAAAA44XLB5E . You are receiving this because you are subscribed to this thread.Message ID: @.***>

pclavell commented 1 year ago

Hello, Indeed, phase I used hg19 and 30x uses hg38, I did account for that using LiftOver which did work as I successfully correlated Fst. Regarding LCT it is true that there is a cluster of around 200 SNP with | iHS |>2 however for the other two genes it is not so clear... I am not convinced about my results. What else could I look at to reassure that my results worked? Thank you very much for the quick answer!

pclavell commented 1 year ago

I've just finished computing XP-EHH for YRI vs CEU. Using only chr22 (and therefore normalizing for this chr because of computing time) I get a Pearson correlation of 0.686. Which is surprising considering that both iHS and XP-EHH are based on EHH. This suggests that there is truly a problem either in selscan iHS calculation or 1000G Selection Browser results.

szpiech commented 1 year ago

Well, this is certainly weird. I’m assuming the selection browser scores have been frequency-bin normalized.

Zachary

Le mar. 19 sept. 2023 à 3:02 AM, Pau Clavell Revelles < @.***> a écrit :

I've just finished computing XP-EHH for YRI vs CEU. Using only chr22 (and therefore normalizing for this chr because of computing time) I get a Pearson correlation of 0.686. Which is surprising considering that both iHS and XP-EHH are based on EHH. This suggests that there is truly a problem either in selscan iHS calculation or 1000G Selection Browser results.

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/102#issuecomment-1724939211, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQUJYPZ7DQXTH7SANR3X3E7R3ANCNFSM6AAAAAA44XLB5E . You are receiving this because you commented.Message ID: @.***>