googlegenomics / gcp-variant-transforms

GCP Variant Transforms
Apache License 2.0
134 stars 55 forks source link

off by 1 error? #694

Closed bradmonk closed 3 years ago

bradmonk commented 3 years ago

This may not be the right place to flag this issue, but I don't know where else to start...

All the variant start positions are off-by-one in the BigQuery gnomad exome dataset compared to the gnomad browser (their web-based gene/variant lookup portal), and also the raw vcf files.

The gnomAD dataset was recently made available on BigQuery (see google cloud marketplace and blog post announcements). On the market place page it mentions

Variant Transforms was used to process these VCF files and import them to BigQuery. VEP annotations were parsed into separate columns for easier analysis using Variant Transforms’ annotation support .

This issue may not have anything to do with this Variant Transform and Annotation code repo. But like I mentioned, I don't know where else to mention it (ideas?). Quickly reproduce the issue by running this gist code in the BigQuery editor. Then navigate to the gnomad website and lookup the gene: PAPLN (or just check out the attached screenshot).

Thanks for your time! b

BigQuery_Gnomad_OffByOne

bradmonk commented 3 years ago

I see now that the gnomad website is annotated by end position, not start position. Problem solved, I think.

samanvp commented 3 years ago

Hi @bradmonk and sorry for late response. I think the root of this discrepancy is due to the underlying coordinate systems; in Variant Transforms we use 0-based coordinate by default while in the gnomAD browser they are using 1-based coordinate. You can confirm this by comparing variants in our BigQuery tables versus the VCF files. Here are the top 3 variants on chromosomeY 2.1.1 exome:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER
Y       2654979 rs781744002     G       C       4781.45 PASS
Y       2655003 rs750874990     C       G       729.42  PASS
Y       2655009 rs1327055451    A       G       2172.34 PASS

And here is our BigQuery table:

Screen Shot 2020-11-09 at 10 03 19 AM

Hope this helps.

samanvp commented 3 years ago

As you might know, we have put together a Jupyter notebook with couple of sample queries to extract information from gnomAD tables. If you have queries that you'd like to share with the rest of community we would be more than happy to add them to that page.

Thank you!