Closed KevinCLydon closed 4 years ago
@KevinCLydon @droazen This adds a new dependency which isn't ideal to avoid implementing 1 function which already exists in our codebase. The implementation in the apache library is also recursive so it's not clear to me that that is going to be a performance improvement.
Without profiling showing this is a meaningful improvement I don't think we should make this change. If we believe the apache implementation is better than ours maybe we can adopt their function without including the entirety of commons-math3:3.3.
@lbergelson If you examine the Apache implementation, you'll find that it makes many fewer recursive calls. The recursion in our implementation happens within a loop, and can produce a crazy number of recursive calls with realistic values for the number of alleles and ploidy.
I think we should certainly do some benchmarking to illustrate the problems with the current implementation, and to test whether the Apache implementation is actually faster.
I know that HTSJDK is opposed in general to new dependencies, but if we want to maintain a high level of code quality we really shouldn't be implementing our own subpar versions of standard mathematical functions, when high-quality and well-tested implementations exist in standard libraries, and these libraries themselves are reasonably lightweight.
@droazen It does look more efficient than our implementation. I would still rather import that 1 function than add all of commons-math3. All of htsjdk and it's dependencies right now are 4.6 mb, with math 3 it's 6.5, so it's a large proportional increase in stuff. If there was a whole bunch of stuff we wanted from it then I would say import the whole thing, but we just want 1 small function, why not just take that 1? It's not like we don't have apache licensed stuff in htsjdk already....
Commons math is essentially abandonware, so it's not like it's getting regular updates anyway.
In gatk we have such calculation effectively cached in a big bidimensional matrix although in GenotypeLikelihoodsCalculator(s)
but is a bit obfuscated because that/those classes do more things apart from that.
If this is something that is done repetively I would say that caching is the better alternative in terms of performance. I don't think in practice would use that much memory and time to construct. No apache-common dependency required.
Perhaps you would like to porr that GATK class/classes in a "decaffeinated" form into htsjdk. I created it in the first place in order to accommodate for non diploid large allele-number calculations so a deficiency in htsjdk back in the day and perhaps now as well.
I think it could benefit of some refactoring ... but part of the design idea that you get a calculator for a specific ploidy and number of alleles and then you ask that calculator several things, for example the number of genotypes is that often is the case that ofthen you are working with a specific combination of ploidy/allele-number and you are better of querying this info to a object that is efficient responding to queries related to that combination as supposed to some static object that must handle any combo.
Merging #1447 into master will decrease coverage by
0.027%
. The diff coverage is33.582%
.
@@ Coverage Diff @@
## master #1447 +/- ##
===============================================
- Coverage 68.396% 68.369% -0.027%
- Complexity 8497 8625 +128
===============================================
Files 583 591 +8
Lines 34413 34962 +549
Branches 5729 5837 +108
===============================================
+ Hits 23537 23903 +366
- Misses 8653 8798 +145
- Partials 2223 2261 +38
Impacted Files | Coverage Δ | Complexity Δ | |
---|---|---|---|
...dk/variant/variantcontext/GenotypeLikelihoods.java | 83.851% <100%> (-1.12%) |
67 <3> (-7) |
|
.../htsjdk/variant/utils/BinomialCoefficientUtil.java | 13.131% <13.131%> (ø) |
7 <7> (?) |
|
...nt/variantcontext/GenotypeNumLikelihoodsCache.java | 90.625% <90.625%> (ø) |
12 <12> (?) |
|
src/main/java/htsjdk/samtools/util/Interval.java | 66.667% <0%> (-1.667%) |
26% <0%> (-3%) |
|
...rc/main/java/htsjdk/variant/vcf/VCFFileReader.java | 66.176% <0%> (-0.988%) |
24% <0%> (+2%) |
|
src/main/java/htsjdk/tribble/gff/Gff3Feature.java | 58.333% <0%> (ø) |
7% <0%> (?) |
|
src/main/java/htsjdk/tribble/MutableFeature.java | 100% <0%> (ø) |
6% <0%> (?) |
|
src/main/java/htsjdk/tribble/gff/Gff3Codec.java | 76.596% <0%> (ø) |
48% <0%> (?) |
|
... and 6 more |
Fixes #1446
Description
Please explain the changes you made here. Explain the motivation for making this change. What existing problem does the pull request solve?
Things to think about before submitting: