The SWAR code now operates on a register of the host CPU at a time as intended.
Note this might actually not be faster on 32-bit, I would have to bench it but in some cases 4 memory reads / lookup-table reads might be faster than blockwide-operations
The SWAR code now operates on a register of the host CPU at a time as intended.
Note this might actually not be faster on 32-bit, I would have to bench it but in some cases 4 memory reads / lookup-table reads might be faster than blockwide-operations