sgkit-dev / vcztools

Partial reimplementation of bcftools for VCF Zarr
Apache License 2.0
3 stars 3 forks source link

Query format #64

Closed Will-Tyler closed 2 months ago

Will-Tyler commented 2 months ago

Overview

This pull request partially implements the query --format functionality from bcftools.

This pull request closes #50.

Approach

The approach consists of two components: a parser and a generator. The parser processes the query format string and produces a format specifiers list. The generator is a function that takes the root VCF Zarr group and generates the result of the query one line at a time. The generator's initializer composes the generator according to the structure of the format specifiers list.

Parser

I implement the parser using PyParsing. We used PyParsing to implement a parser in #49 as well.

Generator

The generator uses Python generators to yield query results one variant position at a time. This approach allows Python to iterate over each Zarr array's chunks independently. The high-level generator zips generators for each of the format specifiers and joins the results to produce a line for each variant position.

Query format language

This implementation does not support the full query format language that bcftools supports.

Here is what this implementation should support:

This implementation does not support looping over samples at a variant site. Additionally, some format specifiers supported by bcftools are recognized by this implementation's parser but lead to an error in the generator (e.g. %END0).

Testing

I add unit tests and validation tests along with my changes. I ran the test suite to check that my changes have good coverage.

Example usage

vcztools query vcz_test_cache/sample.vcf.vcz -f "%REF\t%ALT\n"
A       C
A       G
G       A
T       A
A       G,T
T       .
G       GA,GAC
T       .
AC      A,ATG,C

References

tomwhite commented 2 months ago

I'm going to merge this now. Another follow up would be to add expression, region and sample filtering to query.

Will-Tyler commented 2 months ago

I'll be interested to see how this approach performs. Thanks all for reviewing!