There is a TODO in next_interlaced_row asking to "change the interface of next_interlaced_row to take an output buffer instead of making us return a reference to a buffer that we own". I very much agree with this TODO - it seems that it would be best to output directly to the final buffer (as the next_frame API does) rather than forcing the caller to copy the bytes. I assume that outputting directly to the final buffer would be good for:
Reducing the number of memcpy-like calls
Reducing the number of L1 cache misses
Reducing the memory pressure overall
FWIW, the performance considerations above mostly do not affect the next_frame API (which calls into lower-level functions like next_interlaced_row_impl for non-interlaced images) and therefore mostly do not affect png crate's benchmarks. OTOH, users of the png crate who wish to post-process the output (e.g. to transform RGB into RGBA, or alpha-multiply) may wish to do such post-processing row-by-row (while the freshly decoded row is still hot in the L1 cache). More specifically, the performance considerations to apply to:
Current prototype integrating the png crate into Chromium (currently built on top of the image crate, but working directly with the png crate also wouldn't help because of the current shape of the next_row API)
So (given the presence of TODO + performance benefits), should I just go ahead and make a breaking change to the png::Reader::next_row and png::Reader::next_interlaced_row APIs?
That sounds good to me. If we're doing a breaking change, it might make sense to drop next_row entirely given that it is just a very thin wrapper over next_interlaced_row:
There is a TODO in
next_interlaced_row
asking to "change the interface ofnext_interlaced_row
to take an output buffer instead of making us return a reference to a buffer that we own". I very much agree with this TODO - it seems that it would be best to output directly to the final buffer (as thenext_frame
API does) rather than forcing the caller to copy the bytes. I assume that outputting directly to the final buffer would be good for:memcpy
-like callsFWIW, the performance considerations above mostly do not affect the
next_frame
API (which calls into lower-level functions likenext_interlaced_row_impl
for non-interlaced images) and therefore mostly do not affectpng
crate's benchmarks. OTOH, users of thepng
crate who wish to post-process the output (e.g. to transform RGB into RGBA, or alpha-multiply) may wish to do such post-processing row-by-row (while the freshly decoded row is still hot in the L1 cache). More specifically, the performance considerations to apply to:image::codecs::png::PngReader::read
(which callsnext_row
)png
crate into Chromium (currently built on top of theimage
crate, but working directly with thepng
crate also wouldn't help because of the current shape of thenext_row
API)So (given the presence of TODO + performance benefits), should I just go ahead and make a breaking change to the
png::Reader::next_row
andpng::Reader::next_interlaced_row
APIs?Cargo.toml
struct Row<'data>
struct InterlacedRow<'data>
fn next_row
:pub fn next_row(&mut self) -> Result<Option<Row>, DecodingError>
pub fn next_row(&mut self, buf: &mut [u8]) -> Result<Some<usize>, DecodingError>
, documenting that:buf
is too small for the next rowNone
is there is no next rowfn next_interlaced_row
:pub fn next_interlaced_row(&mut self) -> Result<Option<InterlacedRow>, DecodingError>
pub fn next_interlaced_row(&mut self, buf: &mut [u8]) -> Result<Option<InterlaceInfo>, DecodingError>
, documenting that:buf
is too small for the next interlaced rowNone
if there is no next rowWDYT? Are there some alternative API designs that we should consider first?