sgkit-dev / vcztools

Partial reimplementation of bcftools for VCF Zarr
Apache License 2.0
4 stars 3 forks source link

vcztools view: write performance #94

Open Will-Tyler opened 1 month ago

Will-Tyler commented 1 month ago

Description

As documented in #93, vcztools view is not running as fast as bcftools view on real genome data. I reproduce the performance data below.

This issue tracks understanding the performance and implementing optimizations to improve the performance.

Results

cd performance
python -m compare 1
bcftools view data/chr22.vcf.gz
1.67GiB 0:00:10 [ 156MiB/s] [                             <=>                                                                                                         ]

real    0m10.931s
user    0m10.356s
sys     0m0.522s

vcztools view data/chr22.vcz
1.67GiB 0:00:40 [42.4MiB/s] [                                                                <=>                                                                      ]

real    0m40.290s
user    0m31.387s
sys     0m9.024s

Profiling

Profiling

cProfile.run('write_vcf("performance/data/chr22.vcz", os.devnull)', sort=SortKey.TIME)
         182103 function calls in 39.822 seconds
   Ordered by: internal time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    21541   20.241    0.001   20.241    0.001 {method 'encode' of '_vcztools.VcfEncoder' objects}
      164    7.864    0.048    7.870    0.048 core.py:2343(_decode_chunk)
       26    6.919    0.266    6.919    0.266 {method 'astype' of 'numpy.ndarray' objects}
      220    2.659    0.012   10.533    0.048 core.py:2013(_process_chunk)
        1    0.810    0.810   39.822   39.822 vcf_writer.py:80(write_vcf)
      140    0.498    0.004   11.178    0.080 core.py:2108(_chunk_getitems)
      140    0.310    0.002   11.490    0.082 core.py:1316(_get_selection)
        3    0.162    0.054   38.963   12.988 vcf_writer.py:284(c_chunk_to_vcf)
      499    0.159    0.000    0.159    0.000 {method 'read' of '_io.BufferedReader' objects}
    25005    0.115    0.000    0.115    0.000 {built-in method builtins.print}
      500    0.011    0.000    0.011    0.000 {built-in method io.open}

Using gperftools' CPU profiler:

Total: 3665 samples
     707  19.3%  19.3%      707  19.3% _all_missing
     425  11.6%  30.9%      425  11.6% _write_entry
     419  11.4%  42.3%      420  11.5% _int32_write_entry
     381  10.4%  52.7%      396  10.8% _ZSTD_decompressBlock_internal
     262   7.1%  59.9%      262   7.1% _OBJECT_getitem
     203   5.5%  65.4%      221   6.0% _allocate_from_new_pool
      98   2.7%  68.1%       98   2.7% 0x0000000197f55f5c
      75   2.0%  70.1%       77   2.1% _clear_object_strided_loop
      70   1.9%  72.0%       72   2.0% _OBJECT_setitem
      67   1.8%  73.9%       67   1.8% 0x0000000197f55f94
      64   1.7%  75.6%      637  17.4% ___pyx_pw_9numcodecs_4vlen_8VLenUTF8_5decode
      61   1.7%  77.3%       62   1.7% _int16_write_entry
      58   1.6%  78.9%       58   1.6% _PyArray_FillObjectArray
      54   1.5%  80.3%      106   2.9% _prepare_index
      48   1.3%  81.6%       48   1.3% 0x0000000197f561f8
      46   1.3%  82.9%       46   1.3% 0x0000000197f55eb0
      39   1.1%  84.0%       39   1.1% 0x0000000197f56200
      39   1.1%  85.0%      383  10.5% _array_assign_subscript
      34   0.9%  85.9%       34   0.9% 0x0000000197f56360
      30   0.8%  86.8%       30   0.8% 0x0000000197f55f78
      28   0.8%  87.5%       28   0.8% 0x0000000197f55f40
      24   0.7%  88.2%       24   0.7% _blosc_internal_bshuf_shuffle_bit_eightelem_scal
      23   0.6%  88.8%       23   0.6% __strided_to_strided_copy_references
      22   0.6%  89.4%      245   6.7% _unicode_decode_utf8

References

Will-Tyler commented 1 month ago

Reducing the amount of astype calls might help. Most of the astype calls convert Python object or Unicode string arrays to fixed-length character arrays. I wonder if it is possible to avoid converting these arrays and instead access the elements in C code as Python objects to get the string data.

gperftools shows a lot of activity in the all_missing and write_entry methods.

Finally, I checked to see how often Python was doubling the encoding buffer (see here). I recorded 27 doublings—probably not significant since there are over 21k variants.

References

jeromekelleher commented 1 month ago

I don't think there's a lot we can do here, as I had a close eye on write performance when I was developing the C code. Basically, this is about as fast as it will get without making the C code a lot more complicated.

The "good enough" metric that I was going for here is to produce VCF text at a rate that's greater than bcftools can consume it in a pipeline. I think we're probably within that limit here?

42MB/s is disappointing though, I wonder if this is mostly due to PL fields or something.

Will-Tyler commented 1 month ago

I tried deleting the PL field from the VCZ version:

bcftools view data/chr22.vcf.gz
1.67GiB 0:00:10 [ 163MiB/s] [                          <=>                                                                                             ]

real    0m10.485s
user    0m10.116s
sys     0m0.462s

vcztools view data/chr22.vcz
1.01GiB 0:00:34 [29.6MiB/s] [                                                    <=>                                                                   ]

real    0m34.843s
user    0m28.462s
sys     0m5.968s

There is a slight improvement.

When I run this, vcztools writes the output in bursts. I think vcztools spends time decoding the chunks and then writing them. vcztools may benefit from parallelism here—reading chunks from multiple arrays simultaneously and reading the next set of chunks while writing output.

jeromekelleher commented 1 month ago

Yes, some sort of double-buffering approach where we decode the next variant chunk in the background while the current chunk is being written to output would definitely improve things a lot. The initial latency is still pretty horrible, but I think that's a function of our current chunk-size defaults which are too big in the variants dimension.

Will-Tyler commented 1 month ago

Just collecting some notes here...

PEP-703 explains the challenges in achieving true parallelism in Python. However, Zarr supports multi-threaded parallelism by releasing the global interpreter lock whenever possible during compression and decompression operations (source). Therefore, we should be able to achieve the desired parallelism by using multiple threads to perform tasks. Ideally, we do not want to use multiple processes due to the additional overhead in starting a new Python process (~50 ms) and the need to share memory.

References

jeromekelleher commented 1 month ago

I think a decode thread operating on a double buffer system (i.e. we decode into one buffer while the main thread writes out vcf from the other) would work well here, as we are dominated by decompression time and this does thread well.

Will-Tyler commented 3 weeks ago

The initial latency is still pretty horrible, but I think that's a function of our current chunk-size defaults which are too big in the variants dimension.

I think you are right. I noticed that Python's memory consumption was exceeding the physical RAM available on my device, so I changed the variant chunk size to 1,000. With this smaller chunk size, Python's memory consumption stays within the amount of RAM available on my device. This improves the performance a lot and makes the output less bursty.

bcftools view data/chr22.vcf.gz
1.67GiB 0:00:10 [ 162MiB/s] [                             <=>                                                                                                         ]

real    0m10.530s
user    0m10.166s
sys     0m0.461s

vcztools view data/chr22.vcz
1.67GiB 0:00:24 [68.5MiB/s] [                                                                <=>                                                                      ]

real    0m24.951s
user    0m23.761s
sys     0m3.242s

The profiler still shows the most activity in encoding, decoding, and converting types:

         1341181 function calls (1320786 primitive calls) in 24.297 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    21706    9.049    0.000    9.049    0.000 {method 'encode' of '_vcztools.VcfEncoder' objects}
      981    5.739    0.006    5.755    0.006 core.py:2343(_decode_chunk)
      183    5.702    0.031    5.702    0.031 {method 'astype' of 'numpy.ndarray' objects}
     1569    0.979    0.001    6.754    0.004 core.py:2013(_process_chunk)
     3653    0.446    0.000    0.446    0.000 {method 'read' of '_io.BufferedReader' objects}
        1    0.408    0.408   23.472   23.472 vcf_writer.py:80(write_vcf)