samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Modified numLikelihoods calculation for performance #1447

Closed KevinCLydon closed 4 years ago

KevinCLydon commented 4 years ago

Fixes #1446

Description

Please explain the changes you made here. Explain the motivation for making this change. What existing problem does the pull request solve?

Things to think about before submitting:

lbergelson commented 4 years ago

@KevinCLydon @droazen This adds a new dependency which isn't ideal to avoid implementing 1 function which already exists in our codebase. The implementation in the apache library is also recursive so it's not clear to me that that is going to be a performance improvement.

Without profiling showing this is a meaningful improvement I don't think we should make this change. If we believe the apache implementation is better than ours maybe we can adopt their function without including the entirety of commons-math3:3.3.

droazen commented 4 years ago

@lbergelson If you examine the Apache implementation, you'll find that it makes many fewer recursive calls. The recursion in our implementation happens within a loop, and can produce a crazy number of recursive calls with realistic values for the number of alleles and ploidy.

I think we should certainly do some benchmarking to illustrate the problems with the current implementation, and to test whether the Apache implementation is actually faster.

I know that HTSJDK is opposed in general to new dependencies, but if we want to maintain a high level of code quality we really shouldn't be implementing our own subpar versions of standard mathematical functions, when high-quality and well-tested implementations exist in standard libraries, and these libraries themselves are reasonably lightweight.

lbergelson commented 4 years ago

@droazen It does look more efficient than our implementation. I would still rather import that 1 function than add all of commons-math3. All of htsjdk and it's dependencies right now are 4.6 mb, with math 3 it's 6.5, so it's a large proportional increase in stuff. If there was a whole bunch of stuff we wanted from it then I would say import the whole thing, but we just want 1 small function, why not just take that 1? It's not like we don't have apache licensed stuff in htsjdk already....

Commons math is essentially abandonware, so it's not like it's getting regular updates anyway.

vruano commented 4 years ago

In gatk we have such calculation effectively cached in a big bidimensional matrix although in GenotypeLikelihoodsCalculator(s) but is a bit obfuscated because that/those classes do more things apart from that.

If this is something that is done repetively I would say that caching is the better alternative in terms of performance. I don't think in practice would use that much memory and time to construct. No apache-common dependency required.

vruano commented 4 years ago

Perhaps you would like to porr that GATK class/classes in a "decaffeinated" form into htsjdk. I created it in the first place in order to accommodate for non diploid large allele-number calculations so a deficiency in htsjdk back in the day and perhaps now as well.

vruano commented 4 years ago

I think it could benefit of some refactoring ... but part of the design idea that you get a calculator for a specific ploidy and number of alleles and then you ask that calculator several things, for example the number of genotypes is that often is the case that ofthen you are working with a specific combination of ploidy/allele-number and you are better of querying this info to a object that is efficient responding to queries related to that combination as supposed to some static object that must handle any combo.

codecov-io commented 4 years ago

Codecov Report

Merging #1447 into master will decrease coverage by 0.027%. The diff coverage is 33.582%.

@@               Coverage Diff               @@
##              master     #1447       +/-   ##
===============================================
- Coverage     68.396%   68.369%   -0.027%     
- Complexity      8497      8625      +128     
===============================================
  Files            583       591        +8     
  Lines          34413     34962      +549     
  Branches        5729      5837      +108     
===============================================
+ Hits           23537     23903      +366     
- Misses          8653      8798      +145     
- Partials        2223      2261       +38
Impacted Files Coverage Δ Complexity Δ
...dk/variant/variantcontext/GenotypeLikelihoods.java 83.851% <100%> (-1.12%) 67 <3> (-7)
.../htsjdk/variant/utils/BinomialCoefficientUtil.java 13.131% <13.131%> (ø) 7 <7> (?)
...nt/variantcontext/GenotypeNumLikelihoodsCache.java 90.625% <90.625%> (ø) 12 <12> (?)
src/main/java/htsjdk/samtools/util/Interval.java 66.667% <0%> (-1.667%) 26% <0%> (-3%)
...rc/main/java/htsjdk/variant/vcf/VCFFileReader.java 66.176% <0%> (-0.988%) 24% <0%> (+2%)
src/main/java/htsjdk/tribble/gff/Gff3Feature.java 58.333% <0%> (ø) 7% <0%> (?)
src/main/java/htsjdk/tribble/MutableFeature.java 100% <0%> (ø) 6% <0%> (?)
src/main/java/htsjdk/tribble/gff/Gff3Codec.java 76.596% <0%> (ø) 48% <0%> (?)
... and 6 more