samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Gff3 Tribble Codec #1439

Closed kachulis closed 4 years ago

kachulis commented 4 years ago

Adds a Gff3 tribble codec for reading gff3 files. Based on specification defined at https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md. This PR focuses on implementing the capability to read this format; it does not implement any ontological validation of feature types or allowed relationships between features of different types. It is thus only a partial implementation of the specification.

codecov-io commented 4 years ago

Codecov Report

Merging #1439 into master will increase coverage by 0.172%. The diff coverage is 76.515%.

@@               Coverage Diff               @@
##              master     #1439       +/-   ##
===============================================
+ Coverage     68.343%   68.516%   +0.172%     
- Complexity      8489      8613      +124     
===============================================
  Files            583       589        +6     
  Lines          34391     34843      +452     
  Branches        5729      5808       +79     
===============================================
+ Hits           23504     23873      +369     
- Misses          8668      8715       +47     
- Partials        2219      2255       +36
Impacted Files Coverage Δ Complexity Δ
...main/java/htsjdk/samtools/util/FileExtensions.java 80% <100%> (+5%) 1 <0> (ø) :arrow_down:
src/main/java/htsjdk/tribble/gff/Gff3Feature.java 58.333% <58.333%> (ø) 7 <7> (?)
src/main/java/htsjdk/tribble/gff/Gff3BaseData.java 72.917% <72.917%> (ø) 15 <15> (?)
src/main/java/htsjdk/tribble/gff/Gff3Codec.java 76.596% <76.596%> (ø) 48 <48> (?)
.../main/java/htsjdk/tribble/gff/Gff3FeatureImpl.java 78.462% <78.462%> (ø) 38 <38> (?)
...c/main/java/htsjdk/tribble/gff/SequenceRegion.java 82.353% <82.353%> (ø) 7 <7> (?)
...htsjdk/samtools/util/nio/DeleteOnExitPathHook.java 80.952% <0%> (-9.524%) 3% <0%> (-1%)
...in/java/htsjdk/samtools/SAMSequenceDictionary.java 73.239% <0%> (-2.113%) 45% <0%> (-1%)
src/main/java/htsjdk/samtools/util/Interval.java 66.667% <0%> (-1.667%) 26% <0%> (-3%)
...c/main/java/htsjdk/samtools/SAMFileWriterImpl.java 80.822% <0%> (-1.123%) 25% <0%> (ø)
... and 14 more
kachulis commented 4 years ago

@lbergelson back to you. While I was addressing your comments, I realized that I was depending on parent features being before child features in the input files, which is not at all guaranteed. This is fixed now (with added tests), but means the parent maps, which used to be immutable, are now mutable, with public modifying methods.

kachulis commented 4 years ago

@lbergelson a few changes since you approved, these resulted from getting some full scale use cases working:

1) clear activeParentIDs in Gff3Codec::prepareToFlushFeatures(). This is needed to avoid memory leak and massive garbage collection problems when using (for example) a full scale human annotation file.

2) clear all maps and queues in Gff3Codec::close. This is needed to avoid parsed, but unreturned features from persisting through close, and later being returned incorrectly when the same codec is used to read from a potentially completely different stream.

3) (minor) adjusted directive regex to allow for variable number of spaces

At this point, my use cases are working, and I am content with where this is. I don't have the authorization to merge this, so feel free to go ahead and merge if you don't have any issues with the additional changes.