statgen / pheweb

A tool to build a website to browse hundreds or thousands of GWAS.
MIT License
154 stars 65 forks source link

Check that reference allele is really the reference allele on the reference genome #158

Closed ttbek closed 3 years ago

ttbek commented 3 years ago

For some locations we seem to have LD showing alright, e.g. by HLA, while in other areas we aren't getting it for mostly (but will if we change to some other SNPs as the lead LD SNP, e.g. around TMEM108.

HLA https://pheweb-tcga.qcri.org/region/MHC2_21978456/6:32378449-32778449 TMEM108 https://pheweb-tcga.qcri.org/region/Interferon_19272155/3:132815467-133215467

The live site may come and go as we make some more changes, here are some screenshots:

HLA TMEM108

TMEM108 seems to have LD in the PheWeb for the SAIGE UKBB data: http://pheweb.sph.umich.edu/SAIGE-UKB/gene/TMEM108 TMEM108_SAIGE

I think the SAIGE PheWeb is b37 like ours, and I thought the displayed LD is just from 1000 Genomes, so I thought it would be the same, no?

Is the LD downloaded once while preparing the data, or is it fetched from the Michigan server (https://portaldev.sph.umich.edu/playground ?) on pageloads?

Of course the LD will vary based on what is selected as the lead SNP for LD, but what is the criteria where we get grey (no LD) SNPs? For a few of the grey SNPs in the TMEM108 region, we can get LD colors by changing the lead SNP.

abought commented 3 years ago

PheWeb uses the Michigan LD server. Here is a write up of reasons LD can be missing, from another of our sites that uses the same infrastructure: https://my.locuszoom.org/about/#missing-ld

-Andy Boughton abought@umich.edu

On Feb 8, 2021, at 9:58 AM, ttbek notifications@github.com wrote:

 For some locations we seem to have LD showing alright, e.g. by HLA, while in other areas we aren't getting it for mostly (but will if we change to some other SNPs as the lead LD SNP, e.g. around TMEM108.

HLA https://pheweb-tcga.qcri.org/region/MHC2_21978456/6:32378449-32778449 TMEM108 https://pheweb-tcga.qcri.org/region/Interferon_19272155/3:132815467-133215467

The live site may come and go as we make some more changes, here are some screenshots:

TMEM108 seems to have LD in the PheWeb for the SAIGE UKBB data: http://pheweb.sph.umich.edu/SAIGE-UKB/gene/TMEM108

I think the SAIGE PheWeb is b37 like ours, and I thought the displayed LD is just from 1000 Genomes, so I thought it would be the same, no?

Is the LD downloaded once while preparing the data, or is it fetched from the Michigan server (https://portaldev.sph.umich.edu/playground ?) on pageloads?

Of course the LD will vary based on what is selected as the lead SNP for LD, but what is the criteria where we get grey (no LD) SNPs? For a few of the grey SNPs in the TMEM108 region, we can get LD colors by changing the lead SNP.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

pjvandehaar commented 3 years ago

One of the variants that doesn't work as a refvar is 3:133071425_T/C. PheWeb and the LD server assume that the T must be the reference allele and C the alternate, but the hg19 reference genome has a C at that position.

To make this work, you'll need to modify your input files to have one column that is always the hg19 reference allele and one that is always the alternate. So, you need to swap the C with the T.

Sorry, I need to add a step to pheweb that checks the build, so that people won't keep running into this problem.

I don't have a script that converts your input files, but this bit of python will detect which allele is the reference allele:

from pheweb.load.detect_ref import get_default_builds
hg19 = get_default_builds()[1]
assert hg19.grch_name == 'GRCh37'
print(hg19.get_bases('3', 133071425).upper())

If you write a script to do the conversion, I'd appreciate if you'd post it here to share it with anybody else who needs it.

ttbek commented 3 years ago

Thanks, this makes sense and is most likely the only problem. I will fix them.