pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

Accessing PL field is slow when number of samples gets large #1280

Open astaric opened 4 months ago

astaric commented 4 months ago

Using the master version of pysam and the following benchmark script: https://github.com/astaric/pysam/blob/speed-up-bcf-genotype-count/tests/VariantRecordPL_bench.py I get the following runtimes for accessing the PL field for all samples in a single record:

---------------------------------------------------------------------------------------------------------------- benchmark: 5 tests ----------------------------------------------------------------------------------------------------------------
Name (time in us)                                    Min                        Max                       Mean                StdDev                     Median                   IQR            Outliers          OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_access_pl_values_10_samples                 38.2920 (1.0)             185.3330 (1.0)              41.2211 (1.0)          8.9669 (inf)              39.9170 (1.0)          2.3340 (inf)         15;33  24,259.4229 (1.0)         733           1
test_access_pl_values_100_samples               136.3750 (3.56)            546.6249 (2.95)            142.9587 (3.47)        12.5355 (inf)             139.7920 (3.50)         3.6876 (inf)       119;289   6,995.0286 (0.29)       3296           1
test_access_pl_values_1000_samples            3,246.3749 (84.78)         8,651.0000 (46.68)         3,340.0859 (81.03)      425.3084 (inf)           3,273.8131 (82.02)       87.7501 (inf)           2;8     299.3935 (0.01)        274           1
test_access_pl_values_10000_samples         257,407.6669 (>1000.0)     270,355.5422 (>1000.0)     261,211.9690 (>1000.0)  6,149.1067 (inf)         258,542.3335 (>1000.0)  7,216.5631 (inf)           1;0       3.8283 (0.00)          4           1
test_access_pl_values_100000_samples     25,141,500.8749 (>1000.0)  25,141,500.8749 (>1000.0)  25,141,500.8749 (>1000.0)      0.0000 (1.0)      25,141,500.8749 (>1000.0)      0.0000 (1.0)           0;0       0.0398 (0.00)          1           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

When number of samples gets large, having 10x the samples results in 100x slower execution which indicates quadratic complexity of the operation.

I looked into the codebase and found that most of the time is spent in bcf_get_genotypes. This function is called as part of computing the number of elements in the PL field before each access to the field. bcf_get_genotypes creates an array of GT information for all samples, which is then used to compute max_ploidy. The only part of the array being accessed is the GT information for "current" sample, the rest is discarded.

I am not really sure what the idea behind this max_ploidy check is, if it could be removed one could just access the GT array for the current sample and count the number of values to get the ploidy, which should be much faster than creating the array for all samples.