statgen / pheweb

A tool to build a website to browse hundreds or thousands of GWAS.
MIT License
154 stars 65 forks source link

Support input in gwas-vcf format #180

Closed jielab closed 5 months ago

jielab commented 2 years ago

Hi, Guys:

these days, GWAS files have up to 20 million rows, really very inefficient to query and process, if stored simply as a TXT file.

I think the VCF format is a really good idea, as explained here https://github.com/MRCIEU/gwas2vcf.

Don't know if there is a way to support VCF format for Pheweb.

Best regards, Jie

pjvandehaar commented 2 years ago

Do you mean that internally pheweb should store everything in tabixed bgzipped GWAS-VCF instead of the current tabixed bgzipped tsv files? Why? How would that make queries more efficient?

Or do you just want to use GWAS-VCF as input to create a pheweb? It should be easy to write a script that converts GWAS-VCF into the input format pheweb requires. Do you have one file per phenotype, or many phenotypes in a single file?

jielab commented 2 years ago

Dear Peter:

I mean the latter, pheweb to use GWAS-VCF as input. As you know, these GWAS files with millions of rows are huge. It is very confusing and headache that each software needs different columns and column names. I think we should use VCF's capacity for fast query, which comes with a vcf.tbi file.

I hope that you have a few minutes to read this paper https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02248-0, and agree that supporting VCF format is a good idea.

Best regards, Jie

bschilder commented 2 years ago

Just wanted to chime in that MungeSumstats might be helpful here:

jielab commented 2 years ago

thank you veyr much!

best regards, Jie