sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
233 stars 32 forks source link

Summary statistics IO and methods #440

Open hammer opened 3 years ago

hammer commented 3 years ago

The MRC IEU at Bristol has a specification for storing GWAS summary statistics in a VCF file.

While I certainly have mixed feelings about using VCF files as a container format, they have done the hard work of providing tens of thousands of GWAS summary statistics VCFs at the OpenGWAS project.

There are more details in

It would be great to figure out how to map the data in these GWAS VCF files to the sgkit data model and to write some methods on top of them.

hammer commented 3 years ago

Some additional resources on other approaches to file formats for summary stats

hammer commented 3 years ago

https://github.com/MRCIEU/pygwasvcf is Python code to parse GWAS-VCF files but it's built with pysam rather than cyvcf2, unfortunately.

hammer commented 3 years ago

So, looking at a few example GWAS-VCF files, they're just putting per-variant sumstats into the SAMPLE fields. It appears some files use the INFO field for variant-specific metadata like minor allele frequency that we might want to pick up as well, but otherwise, I don't think parsing is going to be too challenging.

The hard part for us is figuring out if we want to define a blessed data model for sumstats and start adding operations that operate upon it.

eric-czech commented 3 years ago

There is this humanbase tool from Olga Troyanskaya's Lab which runs a NetWAS for you if you provide it sumstats. The docs describe the 3 formats it will let you provide them in: vegas, forge, and PLINK. I don't know anything more about them but they may be worth considering.

hammer commented 3 years ago

Interesting, NetWAS seems to operate on per-gene summary statistics, rather than per-variant. It would be interesting to hear from the Bristol team if they've considered computing per-gene summary statistics as part of their OpenGWAS work.

hammer commented 3 years ago

Another entry in the sumstats library and formats space:

hammer commented 2 years ago

New standard for summary statistics https://ebispot.github.io/gwas-blog/new-standard-for-gwas-summary-statistics