samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

BasicFastqWriter: no longer flushing on every write #1497

Closed d-cameron closed 4 years ago

d-cameron commented 4 years ago

Description

PrintStream.checkError() is called on every write(). Since this flushes the underlying stream, performance when writing fastq files is very poor.

This change defers the call to checkError() to flush() and close() operations.

codecov-commenter commented 4 years ago

Codecov Report

Merging #1497 into master will decrease coverage by 0.020%. The diff coverage is 100.000%.

@@               Coverage Diff               @@
##              master     #1497       +/-   ##
===============================================
- Coverage     69.279%   69.260%   -0.020%     
- Complexity      8764      8796       +32     
===============================================
  Files            590       592        +2     
  Lines          34755     34863      +108     
  Branches        5800      5814       +14     
===============================================
+ Hits           24078     24146       +68     
- Misses          8386      8416       +30     
- Partials        2291      2301       +10     
Impacted Files Coverage Δ Complexity Δ
...n/java/htsjdk/samtools/fastq/BasicFastqWriter.java 76.000% <100.000%> (+26.000%) 9.000 <2.000> (+4.000)
...htsjdk/samtools/util/nio/DeleteOnExitPathHook.java 80.952% <0.000%> (-9.524%) 3.000% <0.000%> (-1.000%)
src/main/java/htsjdk/io/IOPath.java 79.310% <0.000%> (ø) 14.000% <0.000%> (?%)
src/main/java/htsjdk/io/HtsjdkPath.java 51.316% <0.000%> (ø) 15.000% <0.000%> (?%)
d-cameron commented 4 years ago

@lbergelson is this result actually a blocker for merging?

lbergelson commented 4 years ago

@d-cameron Don't worry about the coverage. I need to disable that. The coverage is extremely twitchy in it's complaints.

I'm not sure this is going to help at all because PrintWriter flushes on every write anyway. As far as I can tell it's a pathological class that we should get rid of. (Bad default flushing behavior, but even more critically it's error handling is atrocious.) If you want it to stop flushing on every output (which we do want, ) we should configure it with autoflush = off when creating it. Just removing the checkError call makes us vulnerable to errors without actually fixing the problem I think.

I would be in favor of just replacing the internal PrintStream with a different output stream that doesn't behave so badly.

d-cameron commented 4 years ago

PrintWriter flushes on every write anyway ... configure it with autoflush = off

But autoFlush=false is the default. I've added a test case to show flushing of the underlying stream is fixed with this PR.

Isn't deferring IO errors to when the underlying stream is flushed/closed normal, expected behaviour?

I would be in favor of just replacing the internal PrintStream with a different output stream that doesn't behave so badly.

That would mean changing the BasicFastqWriter API since of it's two constructors take PrintStreams which would be a breaking change (I know it'd break my GRIDSS code). Did the issue of what actually composes the public htsjdk API being undefined ever get resolved?

lbergelson commented 4 years ago

@d-cameron Oh wow, you're totally right about autoflush being off by default. I've been wrong about this for a while I guess and giving incorrect advice about it. Thanks for pointing that out. I'm not sure about deferring errors being standard practice? I would expect it to defer the error until the write actually is attempted and fails, but typically that will be happening many times as the write buffer fills up, not just on closing the writer.

Yes, that would be a breaking change unless we kept the old methods around and deprecated them. I'm not sure it's practical to change it, I just would like to. In general I treat the entire accessible java api as the public API and attempt to avoid breaking changes to any of it. Sometimes that isn't possible, but usually we deprecate things for a long period of time before changing them. There have been some major exceptions though. The cram code was dramatically altered, but we gave a lot of notice about that. I'm also more willing to break newly added API's. Things that were put in in a recent release are less likely to have a lot people using and the users are actively adopting new features so I change those more readily. Things like minor API fixes in the Gff3 classes have been breaking changes on recent code. Old public APIs I try hard to never break without long term deprecation before the break. Public methods are the most sacred. I also attempt to avoid any breaking changes to anything potentially accessible to user code. (i.e. protected fields in non-final classes, etc. )

We're planning to label things better as part of the CZI grant. I want to introduce the notion of beta/unstable api's that we are willing to break because it's extremely painful to make progress on a lot of things now, and we're very slow to add new code partially because we are extremely resistant to breaking existing code.

lbergelson commented 4 years ago

@d-cameron Thank you for fixing this and also for pointing out my misconception about PrintStream!