This pull request partially implements the query --format functionality from bcftools.
This pull request closes #50.
Approach
The approach consists of two components: a parser and a generator. The parser processes the query format string and produces a format specifiers list. The generator is a function that takes the root VCF Zarr group and generates the result of the query one line at a time. The generator's initializer composes the generator according to the structure of the format specifiers list.
Parser
I implement the parser using PyParsing. We used PyParsing to implement a parser in #49 as well.
Generator
The generator uses Python generators to yield query results one variant position at a time. This approach allows Python to iterate over each Zarr array's chunks independently. The high-level generator zips generators for each of the format specifiers and joins the results to produce a line for each variant position.
Query format language
This implementation does not support the full query format language that bcftools supports.
Here is what this implementation should support:
any variant-site-level field except the full INFO field,
newline characters,
tab characters,
subfield indexing with curly brackets.
This implementation does not support looping over samples at a variant site. Additionally, some format specifiers supported by bcftools are recognized by this implementation's parser but lead to an error in the generator (e.g. %END0).
Testing
I add unit tests and validation tests along with my changes. I ran the test suite to check that my changes have good coverage.
Example usage
vcztools query vcz_test_cache/sample.vcf.vcz -f "%REF\t%ALT\n"
A C
A G
G A
T A
A G,T
T .
G GA,GAC
T .
AC A,ATG,C
Overview
This pull request partially implements the
query --format
functionality from bcftools.This pull request closes #50.
Approach
The approach consists of two components: a parser and a generator. The parser processes the query format string and produces a format specifiers list. The generator is a function that takes the root VCF Zarr group and generates the result of the query one line at a time. The generator's initializer composes the generator according to the structure of the format specifiers list.
Parser
I implement the parser using PyParsing. We used PyParsing to implement a parser in #49 as well.
Generator
The generator uses Python generators to yield query results one variant position at a time. This approach allows Python to iterate over each Zarr array's chunks independently. The high-level generator zips generators for each of the format specifiers and joins the results to produce a line for each variant position.
Query format language
This implementation does not support the full query format language that bcftools supports.
Here is what this implementation should support:
This implementation does not support looping over samples at a variant site. Additionally, some format specifiers supported by bcftools are recognized by this implementation's parser but lead to an error in the generator (e.g.
%END0
).Testing
I add unit tests and validation tests along with my changes. I ran the test suite to check that my changes have good coverage.
Example usage
References