It calls bits_to_table_indices, which takes three u128s each containing the value of one of three intermediates for 128 multiplications, and returns four u128s containing a table index in each nibble.
It then reorders those nibbles into bytes as its output. (Originally, the table lookup was done here, but additional optimization moved the table lookup elsewhere.)
It appears that bits_to_table_indices compiles to <200 instructions (fully unrolled with no loops or branches), while the rearranging of nibbles compiles to >1000 instructions (again, fully unrolled with no loops or branches). Implementing a single transpose-like operation covering both steps would probably be more efficient.
intermediates_to_table_indices
works as follows:bits_to_table_indices
, which takes threeu128
s each containing the value of one of three intermediates for 128 multiplications, and returns fouru128
s containing a table index in each nibble.It appears that
bits_to_table_indices
compiles to <200 instructions (fully unrolled with no loops or branches), while the rearranging of nibbles compiles to >1000 instructions (again, fully unrolled with no loops or branches). Implementing a single transpose-like operation covering both steps would probably be more efficient.