statgen / pheweb

A tool to build a website to browse hundreds or thousands of GWAS.
MIT License
154 stars 65 forks source link

rsid change or missing #143

Closed stat-yyang closed 3 years ago

stat-yyang commented 3 years ago

Nice work getting your site up and running– it looks great and runs fast.

If this is still an issue, could you explain more about it? Which files are missing variants, and where had you seen those variants that made you expect them? Do you mean that the variants were in your input files but aren't in generated-by-pheweb/parsed/*.tsv? Could you copy-paste or screenshot the lines (and headers and a couple lines before/after) for one example of a missing variant?


Thanks a lot for your help! I forgot to reply to this issue last month. For example, in the original input file, Screen Shot 2020-08-16 at 11 14 20 AM and in *.tsv file, Screen Shot 2020-08-16 at 11 03 09 AM Take a look at the first a couple of lines, they are matched. There are missing rsid for variants in .tsv file. And some variants have multiple rsid (chrom: 1, pos: 749963)

stat-yyang commented 3 years ago

For example, a very famous variant APOE variant rs429358. It can be searched by the chrom:pos, but can not be searched by its rsid. http://67.205.180.40:443/variant/19:45411941-C-T Screen Shot 2020-09-08 at 10 39 42 AM Is it related to the lookup table we are using? I suppose that some rsids are missing in that lookup table.

We found a large portion of unannotated SNPs in /generated-by-pheweb/sites/sites-rsids.tsv

Screen Shot 2020-09-11 at 2 43 14 PM

Do you know if we can replace the file and make the SNPs matched to its rsids?

Thanks a lot!

stat-yyang commented 3 years ago

I found a possible source of the problem.

The unmatched SNPs could be the mismatch of reference and alternate alleles. For example, 1:693731 G>A can not be matched to rs12238997 because the reference allele in our data is G, not A.

I modified a little bit in add_rsids.py to accommodate the issue.

add rsids = rsids + [rsid['rsid'] for rsid in rsid_group if cpra['ref'] == rsid['alt'] and are_match(cpra['alt'], rsid['ref'])] after line 135

Please correct me if I did something wrong.

pjvandehaar commented 3 years ago

PheWeb expects you to have one column of reference alleles (ref), which are exactly what is on the reference genome, whether that's hg19 or hg38.

For you, is the allele seen on the hg19 reference genome sometimes A1 and sometimes A2? Or is it always A2?

You can check by running cat /tmp/v.vcf | tail -n+2 | cut -f1,3,4,5 | detect-ref chr-pos-a1-a2. detect-ref is included with pheweb, so you probably have it already. Here's a screenshot of me trying it on some of your data that I typed out:

image

Either way, you'll need a script to modify your input files to fix up beta and af. In that script you can change the headers to say ref and alt.