Real VCF example: Genomics England 100,000 genomes project data

jeromekelleher commented 6 months ago

To demonstrate how well vcf-zarr works on real-world data we are going to demo on the 100,000 genomes VCFs (Genomics England project RR1062

We cannot include data here before it has been "airlocked" out, but we can discuss the basic results and keep placeholders while waiting for this process to complete.

The basic idea is to show a table of information about the VCF chunks for a chromosome, and the corresponding information post conversion. Reporting time and memory usage for conversion would also be useful.

A key thing that we want to demonstrate is the performance of basic QC steps, using bcftools and Zarr (it's probably simplest for the narrative here if we use the python Zarr API rather than sgkit. We don't want to confuse people and have to explain too much).

I guess we want to do two things:

Some filtering query to identify variants that doesn't need to look at the genotypes;
A filtering query that does look at the genotypes.

These don't need to be 100% realistic (there's no point in spending ages coding up some complex pipeline), but they should be simplified versions of things that people have actually done (ideally with references).

cc @stallmanGEL @benjeffery

stallmanGEL commented 5 months ago

I think a really common thing to do with filtering is removing variants that don't pass quality thresholds.

The most standard and simple of these is looking at something like read-depth (DP) and genotype quality (GQ) most e.g. This recent paper using GEL's aggregate VCF files, Sinan's filtering protocol etc. but also basically everybody will do this e.g. gnomADv4 used a hard filter including DP and GQ.

This mean DP/GQ information is contained in the FORMAT field, so does not interrogate genotypes directly but does look across all samples. A bcftools command would look something like:

bcftools view --include 'FORMAT/DP>10 & FORMAT/GQ>20' example.vcf

This would hard filter all sites where at least one sample does not have a GQ <= 20 and DP <= 10.

Another simple thing would be looking at a specific region of interest, like gene coordinates. So something like:

bcftools view --include 'FORMAT/DP>10 & FORMAT/GQ>20' --target chr11:5225464-5229395 example.vcf

Would do the same but just for variants overlapping with the HBB gene locus.

jeromekelleher commented 5 months ago

Perfect, let's do exactly that and compare timing with bcftools.

jeromekelleher commented 5 months ago

I've made a start on this in #22, and it looks like there's a pretty startling performance difference even on 1000 Genomes data. The pipeline I'm running is this:

bcftools view -I --include "FORMAT/DP>10 & FORMAT/GQ>20" -Ou tmp/1kg_chr20_all.vcf.gz
    | bcftools +fill-tags -Ou
    | bcftools +af-dist

Which is still running after about three quarters of an hour. The zarr equivalent takes less than a minute.

However, this raises an important point which it would be good to get your thoughts on @stallmanGEL. The processing paradigm that has built up around VCF is around Unix pipes that consume VCF and produce more VCF. So, the bcftools command bcftools view -I --include "FORMAT/DP>10 & FORMAT/GQ>20" produces some VCF, which you then pipe into other tools that consume VCF (which is all well and good).

However, that is not how we imagine Zarr datasets being used. We don't write out copies of the whole thing, we do our calculations directly on the thing itself. Because the computation paradigm is different, it's hard to do apples-to-apples comparisons. Forcing ourselves to emit VCF from the Zarr as a result of filtering here seems pointless.

In actual practise, I would imagine there are two strategies:

Compute a standard genotype filtering mask (or several), store it, and use in downstream calculations
Compute the mask on the fly as you are doing calculations (which is equivalent to the full bcftools pipeline above)

This is what I'm doing here - filtering the calls based on thresholds before classifying the remaining genotypes into hets, homs etc (which is the input for afdist).

It's important that computing afdist isn't seen as a straw man though. The only reason I chose this is because it was the only thing I could find in bcftools that needed to look at all genotypes, and computed one reasonably simple thing (not a ton of different stats).

It's currently looking like we're using afdist throughout the paper here, which is handy if we can just use this one example computation, but I am worried that this will be misunderstood (as in, they think we're actually interested in afdist in itself, and we're somehow just talking about optimising this specific operation).

What do you think @stallmanGEL?

stallmanGEL commented 5 months ago

I think the example works well (although see below re. bcftools VCF - BCF conversion!).

A few questions though (apologies for the naivety!). Is the scope of this comparison to highlight the benefit of Zarr vs VCF for any kind of commonly used query, or specifically for the summary statistic style afdist like commands that perform calculations using genotypes? (+ the necessity of VCF-VCF piping you mentioned, which appears to be another big performance bottleneck).

I do admit that afdist is not a utility I've used myself, but if the scope of the comparison is the latter, then I think it works + is sufficient to highlight this if posed as a basic but indicative example of the issue you want to highlight.

If it is the former, then perhaps a second example using a completely different + more commonly used utility would be useful to alay concerns that Zarr is only useful for afdist (or afdist style commands?) alone or, conversely, that bcftools / VCF format is only bottlenecked by this specific plug-in.

Perhaps of use, I stumbled upon something from the Genomics England RE documentation, which is a list of common questions/queries with respect to utilising aggV2.

There are 4 examples there, the top 3 use a single bcftools query command and output some basic information regarding samples + genotypes or AFs. They are really simple, but obviously this kind of thing is widely used by researchers at GEL, so perhaps replicating one of these (maybe alongside the same --include flag?) would be useful? Or maybe these are too simple / easy a task to highlight the benefits of Zarr.

Also, as second point I thought was worth highlighting from the bcftools developers own Recent retrospective :

The “bcftools view” command provides conversion between the text VCF and the binary BCF format, where both formats can be either plain (uncompressed) or block-compressed with BGZF for random access and compact size. The plain text VCF output is useful for visual inspection, for processing with custom scripts, and as a data exchange format. It should not be used when performance is critical because BCFtools internally uses the binary BCF representation and the conversion between the text VCF format and the binary BCF format is costly. Also compression and decompression is CPU intensive, and therefore when streaming between multiple commands in a pipeline it is recommended to stream uncompressed BCF by appending the option “-Ou.” (I noticed you did already use the -Ou flag).

So perhaps comparing performance of bcftools (step 1 in your pipe) directly on the block compressed VCF is also not a fair comparison as it requires costly file conversion to BCF? Perhaps running directly on a pre-converted BCF file may also be worth doing (or at least highlighting this as a performance bottleneck specific to bcftools?).

jeromekelleher commented 5 months ago

Thanks @stallmanGEL, this is super helpful.

The goal is to convince people that zarr is better at basically everything, at the 100,000 genomes scale. Just about the only thing that Zarr won't be better at is getting all of the information for a single variant, and writing it out as text. That's what VCF is good at.

A bcftools query example would probably be a good thing to do all right, to emphasise quite how inefficient VCF is for accessing subsets of the data. Here's one example I just did on 1000 genomes (2020) data (I'm arbitrarily pulling out one INFO field here, it doesn't really matter which one).

bcftools query -i 'FILTER="PASS"' -f "%CHROM,%POS,%INFO/AN_EUR\n" tmp/1kg_chr20_all.vcf.gz > tmp1.csv

This takes about 9 minutes to run on the 39G vcf.gz file.

Here's the corresponding zarr python code:

    zarr_ds = zarr.open(prefix + ".zarr")
    pass_filter = zarr_ds["variant_filter"][:, 0]
    contig_id = zarr_ds.contig_id[:]
    df = pd.DataFrame(
        {
            "CHROM": contig_id[zarr_ds["variant_contig"].vindex[pass_filter]],
            "POS": zarr_ds["variant_position"].vindex[pass_filter],
            "INFO/AN_EUR": zarr_ds["variant_AN_EUR"].vindex[pass_filter],
        }
    )
    df.to_csv("tmp2.csv", header=False, index=False)

This takes about 7 seconds to run, of which around 3 seconds is outputting the CSV.

The CSV contains 2592958 rows and is identical to the output of bcftools.

jeromekelleher commented 5 months ago

Your point about running this on VCF is well taken, I'll convert to BCF and check. I'd bet there won't be a major difference though.

jeromekelleher commented 5 months ago

Update: 7m 15s on the converted 43G BCF file. So, BCF is significantly better as you'd expect, but it's still a small improvement relatively speaking.

stallmanGEL commented 5 months ago

Awesome - happy to help! This is incredibly exciting stuff.

I'm looking forward to getting hours of my life back!

jeromekelleher commented 5 months ago

I just added an outline narrative for the Genomics England case study to the latex document (see the section "Case study: Genomics England 100,000 genomes")

We're currently waiting on some analysis from @benjeffery to airlock out the data for the table and the discussion in the tet (hopefully can fill in soon).

@stallmanGEL - would you mind filling in some details in the first paragraph about the dataset, with some references etc when you get a chance please?

stallmanGEL commented 5 months ago

Sure! Will take some time out tomorrow (Wednesday 3rd) to fill in some details :)

jeromekelleher commented 4 months ago

@stallmanGEL you mention that two refs use simple queries on single gene regions:

simple queries involving single gene regions~\cite{leggatt2023genotype}~\cite{lam2023repeat}.

I wonder if we could reimplement one of those as one of our query examples? It would nice to say "these are the actual commands run in these papers, and we make them X times faster".

stallmanGEL commented 4 months ago

From Legatt et al. 2023 (~\cite{leggatt2023genotype}).

"[...] was extracted from the aggregated gVCF from which the NPHP1 gene locus (defined using NCBI GRCh38/hg38 (chr2:110,123,335–110,205,062)) was selected using bcftools 1.6. Variants meeting minimum quality standards (GQ > 20, mean GQ of > 35, DP > 10 and maximum missingness of 70%) were identified using vcftools 0.1.16 and retained for downstream analysis (n = 12,780)."

They (for reasons unknown to me) switch from bcftools to vcftools when performing the QC stuff. To make the comparison somewhat easier, it suppose it makes sense to just do this in a single bcftools command? This would be...

bcftools view \ --regions chr2:110123335–110205062 \ --include 'MIN(FMT/GQ > 20) & MIN(FMT/DP > 10) & MEAN(FMT/GQ > 35) & F_MISSING <= 0.7' \ /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/gel_mainProgramme_aggV2_chr2_109880143_110333248.vcf.gz

That is aggV2 chr2 chunk with the relevant gene region. The same can be achieved with the --targets chr2:110123335–110205062 flag I believe, but there are some differences w/respect to positional overlaps etc. and they can be used in conjunction (--regions jumps using the .vcf.csi index, --targets streams the whole VCF). So not sure which is more efficient, or if that will really matter!

@benjeffery - is this simple enough to adapt a 1-1 comparison using the aggV2 VCFs vs Zarr? Is there anything else you'd like me to do to help with this? :)

jeromekelleher commented 4 months ago

This isn't a great example @stallmanGEL because it's essentially just extracting a slice of the full VCF data and writing to a file for later analysis. This is a symptom of the disease really in a way - the thing itself is so cumbersome that the only tolerable way of working with the region your interested in is to pull out a subset and make a copy. We can't really compare with this as we don't currently have an easy way to output VCF text --- and anyway, it's something we specifically don't want people to do.

Do they document some further analysis that they do on the subset? We could show how we just do that analysis directly on the data?

stallmanGEL commented 4 months ago

Yes - that makes sense! Much of what people do is take a VCF, filter it, then use a software / pipeline that takes another VCF as input etc. etc.

They do perform some further analysis on the VCF subset, which is the application of an algorithm (GenePy) for calculating pathogenicty scores for the gene region for each individual. The VCF + annotation manipulation pipeline and associated Python code for the GenePy algorithm be found here: https://github.com/UoS-HGIG/GenePy-1.4.

The algorithm is simple enough, but would require us to pull some additional annotations for all variants in the region, namely CADD scores & gnomADv3 allele frequencies (we have flat files containing these (but they are pretty massive) or alternatively, VEP annotated VCFs (for each chunk of aggV2) with CADD/gnomADv3 AFs are already available for researchers.

Their pipeline does quite a lot of jigging around with VCFs using grep + awk etc before running the algorithm. Seems like this could be sped up immensely - but I wonder if it is a bit of a hassle to replicate as a comparison in the paper?

stallmanGEL commented 4 months ago

Alternatively, a simpler example from a very recent (but high-impact) pre-print: https://www.medrxiv.org/content/10.1101/2024.04.07.24305438v1.full-text#T1

They basically extract variants in aggV2 in 18bp region of an snRNA on chromsome 12 (chr12:120,291,825-120,291,842) and then simply tabulate allele counts of each variant in this region. I think they do this on a subset of the cohort with Neurodevelopmental Disorders.

I have no idea how exactly they performed this e.g. using bcftools etc. other than it was done on aggV2) - but this could be a super simple, alternative if the above is too cumbersome?

jeromekelleher commented 4 months ago

Their pipeline does quite a lot of jigging around with VCFs using grep + awk etc before running the algorithm. Seems like this could be sped up immensely - but I wonder if it is a bit of a hassle to replicate as a comparison in the paper?

I had a quick look and it's pretty hairy - replicating (and being sure we're doing the same thing) would take a few weeks of development in practise I think. I have no doubt that if everything was done properly it would be a matter of seconds to do this fairly simple merging of different information streams. I don't really want to do this much work though, and it also probably wouldn't be fair to highlight their pipeline as an example of something that's super slow.

benjeffery commented 4 months ago

In practice, in the near term at least, I think that the most common workflow is going to be extracting VCF text from the zarr for passing to downstream tools. Demonstrating the efficiency of this might be a good thing as it shows how zarr-vcf can be a drop-in replacement in existing workflows? We can have some initial strawman zarrvcftools to demonstrate this, whilst noting that going to VCF text is not what you actually want to do in the long term? We're already saying in the paper that zarrvcftools is needed.

jeromekelleher commented 4 months ago

Yeah - vcfzarrtools is needed, but outputting VCF text at scale is nontrivial. We have some code in sgkit that does this, but in practise we'd need to pull it out into its own repo to add masking support for it to work on here. It would take several weeks to make something work well enough for the example here. We shouldn't need to do this to make the points we make in this paper.

Is there any bcftools query example that we could use @stallmanGEL? That's easier to mock up a local implementation of because it involves outputting much less text.

stallmanGEL commented 4 months ago

From this paper: https://www.medrxiv.org/content/10.1101/2024.04.07.24305438v1.full-text#T1 - we could easily extract the basic information in their snRNA region of interest (e.g. that they tabulate in Table 1) with a bcftools query command. For just looking at allele counts across the region, it would be something like this:

bcftools query --regions chr12:120291825-120291842 -f '%CHROM:%POS:%REF:%ALT %AC\n' chunk.vcf.gz

Which leverages the INFO/AC tag already present in the VCF file.

They also do this on specific subsets of participants (e.g. those with undiagnosed developmental disorders). For a query command, you'd have to pipe from bcftools view (which recalculates + updates the AC tag on the fly).

bcftools view --regions chr12:120291825-120291842 --samples-file subset1.txt chunk.vcf.gz | bcftools query -f '%CHROM:%POS:%REF:%ALT %AC\n'

Note I haven't tested this, nor do I have the exact subset they used! But I could probably get it (with a bit of effort looking through meta-data).

jeromekelleher commented 4 months ago

That sounds perfect!

We could just do it for the full pre-computed AC tag - we've already illustrated the problem with needing to recompute these tags in the subset matrix compute figure.

jeromekelleher commented 4 months ago

We would have to convert chr12 though, annoyingly... We could just say we pulled out the same size chunk of AC values from the middle of chr2, it would be fine.

stallmanGEL commented 4 months ago

Sounds good to me!

jeromekelleher commented 4 months ago

The don't actually say they used bcftools as such... Presumably the cohort level counts are present in other interfaces in GeL, but you'd have to go to the VCFs if you want the specific samples? I know some of the authors, I could ask them how they did this bit and how painful it was.

stallmanGEL commented 4 months ago

They do say they used aggV2 specifically in the Methods, so they must have interrogated the VCFs somehow. But they also used some other resources it seems, and also seems (from just the tabulated number of people they looked at vs those in aggV2) that they counted alleles among individuals that weren't included in aggV2 as well (perhaps older probands not aligned to GRCh38).

I was going to say - Nicky is just down the corridor in the BDI, so could quickly ask!

stallmanGEL commented 4 months ago

So looks Yuyang's query takes the following form:

bcftools view -Ou \ -r chr12:120293097-120293237 \ -S /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/docs/main_programme_v16_samples.txt \ /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/gel_mainProgramme_aggV2_chr12_119126010_121569846.vcf.gz | bcftools query \ --exclude 'GT="0" | GT="0/0" | GT="./." | GT="0/." | GT="./0" | GT="." | GQ<15 | DP<10' \ --format '%CHROM\t%POS\t%REF\t%ALT\t%AC\t%AN\t%OLD_CLUMPED\t[\|%SAMPLE\:%GT:%AD:%missingness:%DP:%GQ]\n' \ --output output.tsv

Pretty simple (as expected). Just grab everything among a subset of individuals (those included in the v16 release) in a particular region of chromosome 2, then create a TSV file noting the individuals with quality genotyped variants in that region, along with some quality metrics.

jeromekelleher commented 4 months ago

Great, let's implement this then. @benjeffery, are you happy to pick up Zarrifying this?

jeromekelleher commented 4 months ago

This should provide a good starting point btw: https://github.com/sgkit-dev/vcf-zarr-publication/blob/main/src%2Fzarr_afdist.py#L185

benjeffery commented 4 months ago

Yep - happy to pick this up, looks like the dencode of chr2 is complete, so can exactly recreate.

benjeffery commented 3 months ago

@stallmanGEL As written above, only considering a small region (8749 lines, 545MB) of chr12 the bcftools command only takes 22s, 18 of which is in the first part, if the pipe is left out.

This makes it not a great example I think - unless the region of inquiry was much larger.

jeromekelleher commented 3 months ago

This isn't a great example then, so let's rethink.

Maybe we should stick to the existing benchmarks for simplicity.

Run bcftools query --format "%POS\n" > result.txt
Run bcftools +af-dist

We do this just on one chunk of chr2, and do the same on the corresponding slice of the Zarr using a variant of this code copied into a Jupyter notebook.

So, this way we hopefully get numbers that we can run in an interactive session for a single VCF chunk, and can extrapolate the total CPU time. We can also then note the inherent difficulties with splitting things like af-dist over different chunks, and how we have a much better solution in using Xarray and Dask.

That seems like a reasonable narrative?

benjeffery commented 3 months ago

I did a quick search of papers that had filtering criterion in. For example this Japanese whole genome study has:

"Variants were filtered under more stringent criteria for this purpose. Individual’s genotypes were considered no calls if they had a genotype quality (GQ) of less than 20, a depth outside the range of 11 to 64, or if less than 25% of the reads supported the minor allele for het-erozygous calls. Then, sites with SNPs that had a VQSR filter of PASS, a minor allele frequency of >1%, and a call rate of >95% were retained."

Doing this with bcftools and zarr would be possible?

jeromekelleher commented 3 months ago

The trouble with these standard filtering tasks is that they output VCF, and so if we're going to do apples-to-apples we have to also output VCF. That's why we've focused on af-dist so much.

I guess we could make this point - we compute a variant mask and store it based on this or similar criteria, and compare that with the overall time to do the filtering + store the copy? And compare the size of the copy with the size of the mask (which is nothing, basically)?

That would be quite a nice point to make actually. Actually, I've already done something like this further up (https://github.com/sgkit-dev/vcf-zarr-publication/issues/10#issuecomment-1994194864). So shall we do that, as well as the af-dist + POS thing mentioned above? That would be a strong narrative I think.

jeromekelleher commented 3 months ago

We can do all this on one VCF file from chr2 initially btw. Pick one in the middle in case people think we can only efficiently access things linearly in the Zarr.

benjeffery commented 3 months ago

Yes, would be good to clearly make the point that not only is it a smaller on-disk representation, but because you can store masks you don't have to have to keep copying the data.

jeromekelleher commented 3 months ago

Can adapt this code for the single-file Zarr analysis.

We can finish the section saying that you can do filtering stuff like this really easily with Xarray and Dask, and give a timing on how long it takes to compute the mask across the whole dataset.

jeromekelleher commented 3 months ago

So, to recap we want to run and time:

bcftools query --format "%POS\n" chunk_x.vcf.gz > pos_result.txt
bcftools +af-dist chunk_x.vcf.gz
bcftools view -I --include "FORMAT/DP>10 & FORMAT/GQ>20 chunk_x.vcf.gz -Oz > filtered.vcf.gz

jeromekelleher commented 3 months ago

@stallmanGEL - are these simple FORMAT filters fairly sane and representative? There's no point in doing anything too elaborate, it's just illustrative.

stallmanGEL commented 3 months ago

Definitely - the DP/GQ include filter with those parameters is probably the most standard filtration criteria I can think of, and is applied almost universally in my experience!

benjeffery commented 3 months ago

@stallmanGEL I've been reading the bcftool docs, but something isn't clear. For bcftools view -I --include "FORMAT/DP>10 & FORMAT/GQ>20" is a site included if any sample matches the criteria, or all samples?

benjeffery commented 3 months ago

Think I've found it https://samtools.github.io/bcftools/howtos/filtering.html says "When filtering FORMAT tags, the OR logic is applied with multiple samples."

benjeffery commented 3 months ago

Getting all zeros for the bcftools afdist. This isn't the case for the zarr one. Screenshot from 2024-05-10 16-54-25

BTW, vcfzarr does this in 4min...

jeromekelleher commented 3 months ago

Probably missing the AF tag? It very helpfully doesn't error when this is missing and keeps going. It's so disappointing when commands fail, so nice to be spared that bad news.

benjeffery commented 3 months ago

Probably missing the AF tag? It very helpfully doesn't error when this is missing and keeps going. It's so disappointing when commands fail, so nice to be spared that bad news.

Yeah, no AF, and nothing that looks like an AF in the file. Not sure where to get one!

jeromekelleher commented 3 months ago

Yerch. I guess we just have to run bcftools +fill-tags then. Best to store the result so we don't have to do it again, and can report the timing of both steps.

stallmanGEL commented 3 months ago

@stallmanGEL I've been reading the bcftool docs, but something isn't clear. For bcftools view -I --include "FORMAT/DP>10 & FORMAT/GQ>20" is a site included if any sample matches the criteria, or all samples?

Apologies for the delay, I was away! I think you worked it out, but this command specifies to include sites only where the criteria is met in at least one sample. & means both criteria need to be met in the same sample. So yes, if any sample matches both criteria, this site will be included in the output.

bcftools +fill-tags in.vcf.gz -Oz -o out.vcf.gz -- -t AF will do the trick for the missing AF annotation. I'm surprisedaf-dist somehow runs without the tag and deposits all zeroes though...

benjeffery commented 3 months ago

Thanks @stallmanGEL I figured out the fill-tags command and set it running last night. Not sure how long it will take!

benjeffery commented 3 months ago

This zip file contains the notebooks with timings for both bcftools and vcf-zarr for the 3 use cases: Comparison.zip

jeromekelleher commented 3 months ago

That's great @benjeffery. Can you open a PR with these files (and CSVs mentioned elsewhere), say in a new directory "gel-example"?

jeromekelleher commented 3 months ago

I'm going to close this now as completed, as we have a rough version in #123. Will follow up with more specific issues.

benjeffery commented 2 months ago

GeL wanted us to add dates that these analyses were run on - it was 20th May 2024.

sgkit-dev / vcf-zarr-publication

Real VCF example: Genomics England 100,000 genomes project data #10