sgkit-dev / vcztools

Partial reimplementation of bcftools for VCF Zarr
Apache License 2.0
4 stars 3 forks source link

vcztools view: INFO field computation performance #79

Open Will-Tyler opened 2 months ago

Will-Tyler commented 2 months ago

Description

When the user specifies a sample selection in vcztools view, vcztools recalculates the AC and AN INFO fields. This is consistent with bcftools' behavior. vcztools calculates these INFO fields using all of the samples in a variant-wise chunk of genotype data. The current implementation in pure Python using NumPy may be slow and create a lot of overhead. This issue is to improve the computation and memory efficiency. The solution may require calculating AC and AN in a C extension module.

The original code was added in #77.

References

jeromekelleher commented 2 months ago

We already have a C extension module, so it wouldn't be that hard to update it to include computing AC and AN.

jeromekelleher commented 2 months ago

See https://github.com/sgkit-dev/vcztools/pull/77#issuecomment-2334553173 for details on slowdown

jeromekelleher commented 7 hours ago

Putting this in the initial release milestone for now, can triage out later if it's not critical.