pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

Speed up bcf genotype count #1281

Open astaric opened 4 months ago

astaric commented 4 months ago

Replace bcf_get_genotypes in bcf_genotype_count with implementation that does not require reading GT data for all samples. Code for computing the ploidy is based on the bcf_format_get_allele_indices function for getting the values in GT field but it just counts instead of reading the values.

This is an attempt at addressing #1280, but feel free to decline the PR if there is a better way to address this issue. Also not sure if I am allowed to add the benchmark tests since others are mostly comparisons with other implementations.

Before:

---------------------------------------------------------------------------------------------------------------- benchmark: 5 tests ----------------------------------------------------------------------------------------------------------------
Name (time in us)                                    Min                        Max                       Mean                StdDev                     Median                   IQR            Outliers          OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_access_pl_values_10_samples                 38.2920 (1.0)             185.3330 (1.0)              41.2211 (1.0)          8.9669 (inf)              39.9170 (1.0)          2.3340 (inf)         15;33  24,259.4229 (1.0)         733           1
test_access_pl_values_100_samples               136.3750 (3.56)            546.6249 (2.95)            142.9587 (3.47)        12.5355 (inf)             139.7920 (3.50)         3.6876 (inf)       119;289   6,995.0286 (0.29)       3296           1
test_access_pl_values_1000_samples            3,246.3749 (84.78)         8,651.0000 (46.68)         3,340.0859 (81.03)      425.3084 (inf)           3,273.8131 (82.02)       87.7501 (inf)           2;8     299.3935 (0.01)        274           1
test_access_pl_values_10000_samples         257,407.6669 (>1000.0)     270,355.5422 (>1000.0)     261,211.9690 (>1000.0)  6,149.1067 (inf)         258,542.3335 (>1000.0)  7,216.5631 (inf)           1;0       3.8283 (0.00)          4           1
test_access_pl_values_100000_samples     25,141,500.8749 (>1000.0)  25,141,500.8749 (>1000.0)  25,141,500.8749 (>1000.0)      0.0000 (1.0)      25,141,500.8749 (>1000.0)      0.0000 (1.0)           0;0       0.0398 (0.00)          1           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

After:

-------------------------------------------------------------------------------------------------------- benchmark: 5 tests --------------------------------------------------------------------------------------------------------
Name (time in us)                                Min                    Max                   Mean                StdDev                 Median                   IQR            Outliers          OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_access_pl_values_10_samples             37.2080 (1.0)         579.0000 (1.59)         41.2996 (1.0)         20.2604 (2.46)         38.2080 (1.0)          2.3339 (1.06)        28;85  24,213.2972 (1.0)        1115           1
test_access_pl_values_100_samples            97.6250 (2.62)        364.2500 (1.0)         101.1441 (2.45)         8.2463 (1.0)          99.4590 (2.60)         2.2089 (1.0)       121;508   9,886.8864 (0.41)       5262           1
test_access_pl_values_1000_samples          684.9580 (18.41)     6,552.7500 (17.99)       728.7534 (17.65)      365.4035 (44.31)       693.0421 (18.14)        8.1980 (3.71)        7;107   1,372.2062 (0.06)        905           1
test_access_pl_values_10000_samples       6,489.2501 (174.40)   11,990.0419 (32.92)     7,187.3636 (174.03)   1,384.8602 (167.94)    6,695.7921 (175.25)     236.6672 (107.14)      15;20     139.1331 (0.01)        138           1
test_access_pl_values_100000_samples     77,245.3330 (>1000.0)  83,110.0838 (228.17)   78,858.3527 (>1000.0)  1,470.8695 (178.37)   78,792.9581 (>1000.0)  1,061.1972 (480.43)        2;1      12.6810 (0.00)         13           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------