Open hammer opened 3 years ago
Some additional resources on other approaches to file formats for summary stats
https://github.com/MRCIEU/pygwasvcf is Python code to parse GWAS-VCF files but it's built with pysam
rather than cyvcf2
, unfortunately.
So, looking at a few example GWAS-VCF files, they're just putting per-variant sumstats into the SAMPLE fields. It appears some files use the INFO field for variant-specific metadata like minor allele frequency that we might want to pick up as well, but otherwise, I don't think parsing is going to be too challenging.
The hard part for us is figuring out if we want to define a blessed data model for sumstats and start adding operations that operate upon it.
There is this humanbase tool from Olga Troyanskaya's Lab which runs a NetWAS for you if you provide it sumstats. The docs describe the 3 formats it will let you provide them in: vegas, forge, and PLINK. I don't know anything more about them but they may be worth considering.
Interesting, NetWAS seems to operate on per-gene summary statistics, rather than per-variant. It would be interesting to hear from the Bristol team if they've considered computing per-gene summary statistics as part of their OpenGWAS work.
Another entry in the sumstats library and formats space:
New standard for summary statistics https://ebispot.github.io/gwas-blog/new-standard-for-gwas-summary-statistics
The MRC IEU at Bristol has a specification for storing GWAS summary statistics in a VCF file.
While I certainly have mixed feelings about using VCF files as a container format, they have done the hard work of providing tens of thousands of GWAS summary statistics VCFs at the OpenGWAS project.
There are more details in
It would be great to figure out how to map the data in these GWAS VCF files to the
sgkit
data model and to write some methods on top of them.