jeromekelleher / vcztools

Partial reimplementation of bcftools for VCF Zarr
Apache License 2.0
0 stars 1 forks source link

Performance problems #4

Open jeromekelleher opened 1 week ago

jeromekelleher commented 1 week ago

While performance on large genotype only VCFs is excellent (better than bcftools in terms of throughput in my tests), it is quite poor on complicated VCFs. For chromosome 2 on the recent 1000 genomes data I'm getting less than 1MB per second, which is 50X less than we need (bcftools view is doing around 60 MB/s).

My sense is that it's probably not worth chasing perf here using numba. Jax also doesn't seem like a good fit. I'm actually inclined to write C extension that follows the logic of the current buffer-based numba approach, as I think it would be less work in the long run, get rid of the nasty latency issues involved in JIT compiling. For something like this, I think a well written C extension is less maintenance work than fancy python based stuff. Once you write a C extension, it really doesn't need much maintenance.

We should do some profiling first to see where the bottlenecks are, though.