Open Will-Tyler opened 1 month ago
Reducing the amount of astype
calls might help. Most of the astype
calls convert Python object or Unicode string arrays to fixed-length character arrays. I wonder if it is possible to avoid converting these arrays and instead access the elements in C code as Python objects to get the string data.
gperftools shows a lot of activity in the all_missing
and write_entry
methods.
Finally, I checked to see how often Python was doubling the encoding buffer (see here). I recorded 27 doublings—probably not significant since there are over 21k variants.
I don't think there's a lot we can do here, as I had a close eye on write performance when I was developing the C code. Basically, this is about as fast as it will get without making the C code a lot more complicated.
The "good enough" metric that I was going for here is to produce VCF text at a rate that's greater than bcftools can consume it in a pipeline. I think we're probably within that limit here?
42MB/s is disappointing though, I wonder if this is mostly due to PL fields or something.
I tried deleting the PL field from the VCZ version:
bcftools view data/chr22.vcf.gz
1.67GiB 0:00:10 [ 163MiB/s] [ <=> ]
real 0m10.485s
user 0m10.116s
sys 0m0.462s
vcztools view data/chr22.vcz
1.01GiB 0:00:34 [29.6MiB/s] [ <=> ]
real 0m34.843s
user 0m28.462s
sys 0m5.968s
There is a slight improvement.
When I run this, vcztools writes the output in bursts. I think vcztools spends time decoding the chunks and then writing them. vcztools may benefit from parallelism here—reading chunks from multiple arrays simultaneously and reading the next set of chunks while writing output.
Yes, some sort of double-buffering approach where we decode the next variant chunk in the background while the current chunk is being written to output would definitely improve things a lot. The initial latency is still pretty horrible, but I think that's a function of our current chunk-size defaults which are too big in the variants dimension.
Just collecting some notes here...
PEP-703 explains the challenges in achieving true parallelism in Python. However, Zarr supports multi-threaded parallelism by releasing the global interpreter lock whenever possible during compression and decompression operations (source). Therefore, we should be able to achieve the desired parallelism by using multiple threads to perform tasks. Ideally, we do not want to use multiple processes due to the additional overhead in starting a new Python process (~50 ms) and the need to share memory.
I think a decode thread operating on a double buffer system (i.e. we decode into one buffer while the main thread writes out vcf from the other) would work well here, as we are dominated by decompression time and this does thread well.
The initial latency is still pretty horrible, but I think that's a function of our current chunk-size defaults which are too big in the variants dimension.
I think you are right. I noticed that Python's memory consumption was exceeding the physical RAM available on my device, so I changed the variant chunk size to 1,000. With this smaller chunk size, Python's memory consumption stays within the amount of RAM available on my device. This improves the performance a lot and makes the output less bursty.
bcftools view data/chr22.vcf.gz
1.67GiB 0:00:10 [ 162MiB/s] [ <=> ]
real 0m10.530s
user 0m10.166s
sys 0m0.461s
vcztools view data/chr22.vcz
1.67GiB 0:00:24 [68.5MiB/s] [ <=> ]
real 0m24.951s
user 0m23.761s
sys 0m3.242s
The profiler still shows the most activity in encoding, decoding, and converting types:
1341181 function calls (1320786 primitive calls) in 24.297 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
21706 9.049 0.000 9.049 0.000 {method 'encode' of '_vcztools.VcfEncoder' objects}
981 5.739 0.006 5.755 0.006 core.py:2343(_decode_chunk)
183 5.702 0.031 5.702 0.031 {method 'astype' of 'numpy.ndarray' objects}
1569 0.979 0.001 6.754 0.004 core.py:2013(_process_chunk)
3653 0.446 0.000 0.446 0.000 {method 'read' of '_io.BufferedReader' objects}
1 0.408 0.408 23.472 23.472 vcf_writer.py:80(write_vcf)
Description
As documented in #93, vcztools view is not running as fast as bcftools view on real genome data. I reproduce the performance data below.
This issue tracks understanding the performance and implementing optimizations to improve the performance.
Results
Profiling
Profiling
Using gperftools' CPU profiler:
References