use rotation to remove branching in inner loop

mcroomp commented 5 months ago

Felt inspired for some reason. Who knew that there was another 20% of optimization left in the inner loop @danielrh

This change removes 2 branches from the inner loop:

instead of having two paths for updating the counter (one for the high byte and one for the low byte), use bit rotation at the beginning and end (depending on if it was a 1 or 0) to use the same code in both cases
use a different mask for special casing, which allows the optimizer to use a conditional move instead of a jump instruction

danielrh commented 5 months ago

Nice find! It's amazing how adding a few computations before the branch must be taken can allow the processor to do things while the PC finds its new home :-)

mcroomp commented 5 months ago

Nice find! It's amazing how adding a few computations before the branch must be taken can allow the processor to do things while the PC finds its new home :-)

Thanks! Effectively now there's only one branch in the inner loop, which is when counters need to be normalized, and branch prediction deals with this effectively since it happens infrequently. Avoiding the 50% probability jump was the biggest gain.

Melirius commented 5 months ago

Really nice! I have tried to improve this part by interleaving counters bits, also improving LUT cache locality, but it was too much work per each update.

mcroomp commented 5 months ago

Really nice! I have tried to improve this part by interleaving counters bits, also improving LUT cache locality, but it was too much work per each update.

From the profiling, most of the memory latency is coming from the Model tables. One idea would be to have a much smaller table that contains most of the common coefficients/numzeros/etc, or maybe even a hashtable to use memory more efficiently.

Melirius commented 5 months ago

Really nice! I have tried to improve this part by interleaving counters bits, also improving LUT cache locality, but it was too much work per each update.

From the profiling, most of the memory latency is coming from the Model tables. One idea would be to have a much smaller table that contains most of the common coefficients/numzeros/etc, or maybe even a hashtable to use memory more efficiently.

Hashtable does not help, I tried. But some of the table values of initial Lepton tables are not used, so tables can be more compact. For example, num_non_zeros_to_bin function has output range of [0,9] inclusive, so that num_non_zeros_counts7x7 1st dimension can be shrunk accordingly (note also that max predictor for non-zeros is not 49 but 25, then used range is even [0,8]), 0th bin is not used in exponent_counts - nothing is decoded if no 7x7 coefficients present, residual_noise_counts can be split into two tables for 7x7 and edge coefficients with lower dimensions for each, etc. Also get_grid function arrays can be compactified by a trick: note it accesses only [0][0] element, then [1][0-1], then [2][0-3], ... [n][0-(2^n-1)], second index being decoded_so_far n-bit value. This can be converted to one-dimensional array with (n+1)-bit indices, let be starting decoded_so_far=1 and then decoded_so_far <<= 1; decoded_so_far |= cur_bit as usize;. Finally value can be obtained just by zeroing this leading bit of decoded_so_far.

mcroomp commented 5 months ago

@Melirius for the lookup of the 64K probability table, it would be nice to reduce this so it fits in 32K in most cases. The usage is a bit uneven, so maybe a minor transform before the lookup (if it is a lightweight bit operation) might help with the cache lines.

The graph below is the number of times that a particular value in the probability table is looked up (red the most, green the least)

Melirius commented 5 months ago

@mcroomp Yes, I've thought about it, but as I said before interleaving does not help - counters update become cumbersome (2 mask shifts, 3 or, 1 and, 1 inc in general case and more or less the same ops for overflow) and slow. Effectively most of the time next value will be taken from the same cache line (next in row) or next line of LUT (next in column), so I've tried also manual prefetch of this lower line - the results were inconclusive on x64 and really bad on ARM.

The pattern is understandable as after you have the first overflow at least one of the counters is never less than 129, so left-top quadrant is never referenced again. Red patch at the corner comes from rarely accessed branches that do not accumulate much hits. Tantalizing thing is that differences of adjacent values in these 3/4 of the square is only [-2,+2] inclusive, but I cannot figure out how to use it for LUT compactification.

Melirius commented 5 months ago

The problem here is that for every decoded bit of the file you query these functions, and every additional operation here substantially increases total number of instructions. For example, I have tried also "division-by-multiplication" approach, where you can use only 512 elements LUT of 32-bit (actually even less: for frequency sums [2,510] inclusive), then cache misses drop down from ~1.6 % to ~0.5 %, but number of instructions executed increases by ~6 % and performance drops by ~3 %.

mcroomp commented 5 months ago

One other idea was to try to improve the unary read function (since this is one of the main costs), since it only reads 1s until it exits the loop. As soon as we see the next bit is going to be a 1, we can in theory already start doing the work for the next bit if we can get the CPU to interleave the instructions. My main problem has been that the compiler is being too smart and optimizing away my optimizations…

On Mon, Apr 15, 2024 at 5:32 PM Ivan Siutsou @. @.> > wrote:

The problem here is that for every decoded bit of the file you query these functions, and every additional operation here substantially increases total number of instructions. For example, I have tried also "division-by-multiplication" approach, where you can use only 512 elements LUT of 32-bit (actually even less: for frequency sums [2,510] inclusive), then cache misses drop down from ~1.6 % to ~0.5 %, but number of instructions executed increases by ~6 % and performance drops by ~3 %.

— Reply to this email directly, view it on GitHub https://github.com/microsoft/lepton_jpeg_rust/pull/63#issuecomment-2057145990 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6EMMV2EXWB4I52LQKU473Y5PXJVAVCNFSM6AAAAABGDR66PSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJXGE2DKOJZGA . You are receiving this because you were mentioned. https://ci3.googleusercontent.com/meips/ADKq_NbK1w99-F8InDEO2ip8sqglX5b5YeDrrbyF5qZFqkVRy5rLPi0jRjh00zBlkx-x2lgr7njrXtELqM-VHGAQvSzWr0m4Z4Q37BljaoiYt9kXiM6dOT31gSnk-v3t4grj6F0BYJ04Eykk429N_ip6V1KVyyOXqxGvguLtDztzrsJ9e6RnCtfue3FwZx2bW20BuSAgKDFnEhbVcj2xQlIGH0nPWOa5OrgmrPXoAZsp6sRgRbyKHG1gnNg=s0-d-e1-ft#https://github.com/notifications/beacon/AE6EMMVVNJGFYWZN45S6DFLY5PXJVA5CNFSM6AAAAABGDR66PSWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTT2TWHIM.gif Message ID: @.***>

microsoft / lepton_jpeg_rust

use rotation to remove branching in inner loop #63