sgkit-dev / vcf-zarr-publication

Manuscript and associated scripts for vcf-zarr publication
2 stars 7 forks source link

Real VCF example: Genomics England 100,000 genomes project data #10

Closed jeromekelleher closed 3 months ago

jeromekelleher commented 6 months ago

To demonstrate how well vcf-zarr works on real-world data we are going to demo on the 100,000 genomes VCFs (Genomics England project RR1062

We cannot include data here before it has been "airlocked" out, but we can discuss the basic results and keep placeholders while waiting for this process to complete.

The basic idea is to show a table of information about the VCF chunks for a chromosome, and the corresponding information post conversion. Reporting time and memory usage for conversion would also be useful.

A key thing that we want to demonstrate is the performance of basic QC steps, using bcftools and Zarr (it's probably simplest for the narrative here if we use the python Zarr API rather than sgkit. We don't want to confuse people and have to explain too much).

I guess we want to do two things:

  1. Some filtering query to identify variants that doesn't need to look at the genotypes;
  2. A filtering query that does look at the genotypes.

These don't need to be 100% realistic (there's no point in spending ages coding up some complex pipeline), but they should be simplified versions of things that people have actually done (ideally with references).

cc @stallmanGEL @benjeffery