Closed Shnatsel closed 3 years ago
In parallel mode this speeds up decoding from 320ms to 300ms, or at about 1%. No change when rayon is not enabled because the IDCT worker thread is not in the critical path then.
Tested on https://commons.wikimedia.org/wiki/File:Sun_over_Lake_Hawea,_New_Zealand.jpg
I wonder how complex it would be to funnel through the precise type &[i16; 64]
?
Nevermind, that obviously requires bytemuck
or probably const generics..
A lot of precise types can and should be funnelled through in theory. That match on a usize
is gross, instead of panicking here it should just be an enum with custom discriminants.
A lot of precise types can and should be funnelled through in theory. That match on a usize is gross, instead of panicking here it should just be an enum with custom discriminants.
There are probably a few other types that are not properly newtype wrapped or should be enums, yeah.
Convert a debug assert into an actual assert to use as an optimizer hint. Eliminates a bunch of bounds checks.
It reduces the number of assembly instructions, but I did not see any measurable performance improvement on my machine.
Instruction counts for
jpeg_decoder::worker::immediate::ImmediateWorker::append_row_immediate
where this gets inlined: before - 1420, after - 1363.Top 10 instructions with counts, before:
Instruction counts after: