samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Gff3BaseData precompute hash and Gff3Codec decodeShallow method #1480

Closed kachulis closed 4 years ago

kachulis commented 4 years ago

Four basic changes: 1) Since Gff3BaseData hash is based only on final fields, compute hash once when object is initialized. This speeds up any code which uses Gff3Feature or Gff3BaseData as a key in a HashMap.

2) Make Gff3BaseData constructor public, (previously package-private) fields private, and add public getter methods to access those fields.

3) Add decodeShallow method, which returns only a flat Gff3Feature, which is NOT linked to parent/child/cofeatures. While this method should not be used for general analysis, it is necessary to preprocess (specifically to add flush directives to) large gff3 files which do not already have flush directives. Otherwise, the standard decode method will lead to the entire file being held in memory, and GC overhead problems.

4) Adds score field to Gff3Feature. This column is included in the spec, was somehow left out of the original Gff3 commit.

codecov-io commented 4 years ago

Codecov Report

Merging #1480 into master will increase coverage by 0.863%. The diff coverage is 69.149%.

@@               Coverage Diff               @@
##              master     #1480       +/-   ##
===============================================
+ Coverage     68.360%   69.223%   +0.863%     
- Complexity      8624      8717       +93     
===============================================
  Files            591       588        -3     
  Lines          34962     34607      -355     
  Branches        5837      5784       -53     
===============================================
+ Hits           23900     23956       +56     
+ Misses          8800      8368      -432     
- Partials        2262      2283       +21     
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/tribble/gff/Gff3Feature.java 53.846% <53.846%> (-4.487%) 7.000 <7.000> (ø)
src/main/java/htsjdk/tribble/gff/Gff3Codec.java 76.316% <63.636%> (-0.280%) 50.000 <3.000> (+2.000) :arrow_down:
src/main/java/htsjdk/tribble/gff/Gff3BaseData.java 77.941% <72.917%> (+5.025%) 28.000 <17.000> (+13.000)
.../main/java/htsjdk/tribble/gff/Gff3FeatureImpl.java 78.947% <81.818%> (+0.486%) 38.000 <2.000> (ø)
...n/java/htsjdk/samtools/RandomAccessFileBuffer.java 0.000% <0.000%> (-71.667%) 0.000% <0.000%> (-10.000%)
src/main/java/htsjdk/samtools/CRAMBAIIndexer.java 71.711% <0.000%> (-9.727%) 12.000% <0.000%> (-5.000%)
...m/encoding/core/SubexponentialIntegerEncoding.java 85.714% <0.000%> (-8.730%) 5.000% <0.000%> (+1.000%) :arrow_down:
...c/main/java/htsjdk/samtools/cram/build/CramIO.java 67.347% <0.000%> (-7.996%) 8.000% <0.000%> (-12.000%)
src/main/java/htsjdk/samtools/CRAMCRAIIndexer.java 74.510% <0.000%> (-7.308%) 10.000% <0.000%> (ø%)
...tools/cram/encoding/core/GammaIntegerEncoding.java 93.333% <0.000%> (-6.667%) 5.000% <0.000%> (+1.000%) :arrow_down:
... and 159 more
codecov-commenter commented 4 years ago

Codecov Report

Merging #1480 into master will increase coverage by 0.882%. The diff coverage is 74.312%.

@@               Coverage Diff               @@
##              master     #1480       +/-   ##
===============================================
+ Coverage     68.360%   69.242%   +0.882%     
- Complexity      8624      8722       +98     
===============================================
  Files            591       588        -3     
  Lines          34962     34615      -347     
  Branches        5837      5787       -50     
===============================================
+ Hits           23900     23968       +68     
+ Misses          8800      8364      -436     
- Partials        2262      2283       +21     
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/tribble/gff/Gff3Feature.java 53.846% <53.846%> (-4.487%) 7.000 <7.000> (ø)
src/main/java/htsjdk/tribble/gff/Gff3BaseData.java 77.941% <73.469%> (+5.025%) 28.000 <17.000> (+13.000)
src/main/java/htsjdk/tribble/gff/Gff3Codec.java 80.000% <80.556%> (+3.404%) 54.000 <24.000> (+6.000)
.../main/java/htsjdk/tribble/gff/Gff3FeatureImpl.java 78.947% <81.818%> (+0.486%) 38.000 <2.000> (ø)
...n/java/htsjdk/samtools/RandomAccessFileBuffer.java 0.000% <0.000%> (-71.667%) 0.000% <0.000%> (-10.000%)
src/main/java/htsjdk/samtools/CRAMBAIIndexer.java 71.711% <0.000%> (-9.727%) 12.000% <0.000%> (-5.000%)
...m/encoding/core/SubexponentialIntegerEncoding.java 85.714% <0.000%> (-8.730%) 5.000% <0.000%> (+1.000%) :arrow_down:
...c/main/java/htsjdk/samtools/cram/build/CramIO.java 67.347% <0.000%> (-7.996%) 8.000% <0.000%> (-12.000%)
src/main/java/htsjdk/samtools/CRAMCRAIIndexer.java 74.510% <0.000%> (-7.308%) 10.000% <0.000%> (ø%)
...tools/cram/encoding/core/GammaIntegerEncoding.java 93.333% <0.000%> (-6.667%) 5.000% <0.000%> (+1.000%) :arrow_down:
... and 161 more
kachulis commented 4 years ago

@lbergelson thanks for the review! I think I have addressed all of you comments. Also made a bit of a change to how the shallow decoding is implemented, instead of having a 'decodeShallow' method, the class can now be either in shallow or deep (default) mode, and the decode method returns either a shallow feature or a fully linked feature accordingly. The point is that this way I can still use the AbstractFeatureReader.getFeatureReader method for reading in features shallowly, instead of having to duplicate the code from the different FeatureReaders.

Just a heads up, there's actually 2 new commits 38a38bb as well as 4ae8ff0. The first one I just made before your comments posted, so it's gotten buried back at the top of the thread.

back to you!

lbergelson commented 4 years ago

@kachulis I think making it a mode in the decoder is an improvement! Good idea.