Don't emulate carryless multiply for prefix_xor

travisstaloch commented 1 year ago

Please explain the benefit and rationale of this patch and provide benchmark showing before and after. Here's one way to do quick and dirty benchmark:

cd /tmp
git clone git@github.com:travisstaloch/simdjzon.git
cd simdjzon
wget https://raw.githubusercontent.com/simdjson/simdjson/master/jsonexamples/twitter.json -O twitter.json

zig build -Drelease-fast
# zig build -Daarch64-macos -Drelease-fast
zig-out/bin/simdjzon twitter.json -v
# $ valid parse in 2.857ms

Validark commented 1 year ago

Well, the intuitive answer is that this is the canonical prefix-xor operation based on the definition of what you're trying to achieve:

z ^= z << 1;
z ^= z << 2;
z ^= z << 4;
z ^= z << 8;
z ^= z << 16;
z ^= z << 32;

Using a carryless multiply is a clever trick for accomplishing the same task on architectures that support that instruction, because a carryless multiply by -1 does effectively what the code above does. Hence, I think it stands to reason to assume that software emulation of carryless multiply so it can be used to emulate the above code is probably doomed to be slower. But let's dig into the carryless multiply code.

First we have

const x = @truncate(u64, if (half == .hi or half == .hi_lo) x_ >> 64 else x_);
const y = @truncate(u64, if (half == .hi) y_ >> 64 else y_);

I'm going to assume that can be optimized out since x is defined by stuffing a number into the upper 64 bits, and y is always -1.

Next we have

const x0 = x & 0x1111111111111110;
const x1 = x & 0x2222222222222220;
const x2 = x & 0x4444444444444440;
const x3 = x & 0x8888888888888880;
const y0 = y & 0x1111111111111111;
const y1 = y & 0x2222222222222222;
const y2 = y & 0x4444444444444444;
const y3 = y & 0x8888888888888888;

So here we have to load 8 constants into memory and perform 8 AND's. However, since y is -1, we can do half the AND's at comptime. Assuming you can load the constants in 1 cycle each, at least 4 per cycle, and you can do 4 AND's in a cycle, this would take 2 cycles. Some machines might take a little longer to load 64 bit constants, so this might take 3 cycles.

const z0 = (x0 * @as(u128, y0)) ^ (x1 * @as(u128, y3)) ^ (x2 * @as(u128, y2)) ^ (x3 * @as(u128, y1));
const z1 = (x0 * @as(u128, y1)) ^ (x1 * @as(u128, y0)) ^ (x2 * @as(u128, y3)) ^ (x3 * @as(u128, y2));
const z2 = (x0 * @as(u128, y2)) ^ (x1 * @as(u128, y1)) ^ (x2 * @as(u128, y0)) ^ (x3 * @as(u128, y3));
const z3 = (x0 * @as(u128, y3)) ^ (x1 * @as(u128, y2)) ^ (x2 * @as(u128, y1)) ^ (x3 * @as(u128, y0));

Now we have 4 sets of 4 multiplications that get XORed together. With infinitely wide hardware, you could do all these multiplications simultaneously, then you could XOR the first two multiplications in each set while doing the same for the last two. The cycle after that, you could XOR the two answers from the previous step. On infinitely wide hardware, that means you'd have 2 cycles of XORing after the multiplication cycles, which typically have a latency of 3 cycles. So you'd need a minimum of 5 cycles to compute this, although on a computer where you can only start 1 multiplication per cycle, you'd have to wait until the 16th cycle to begin the last multiplication, so another 3 to complete it would be 19, plus 1 cycle to do the final XOR would be 20 cycles. If you could start 2 multiplications per cycle, it would take 12 cycles. If you could start 4 multiplications per cycle, it would take 8 cycles. Plus whatever gets wasted on register juggling. Considering the fact that we are emulating carryless multiply in software, we're probably not running this code on a supercomputer.

const x0_mask = @as(u64, 0) -% (x & 1);
const x1_mask = @as(u64, 0) -% ((x >> 1) & 1);
const x2_mask = @as(u64, 0) -% ((x >> 2) & 1);
const x3_mask = @as(u64, 0) -% ((x >> 3) & 1);

These operations can also all be performed in parallel, and as written would take at least 3 cycles. However, these can be optimized by shifting the desired bit to the most significant bit, and then doing a signed (arithmetic) right shift by 63, so the optimized version on an 4-wide machine would take 2 cycles. Because the only data required for this snippet is x, we could get this done during the multiplications phase in the previous snippet. 0 added cycles.

const extra = (x0_mask & y) ^ (@as(u128, x1_mask & y) << 1) ^
        (@as(u128, x2_mask & y) << 2) ^ (@as(u128, x3_mask & y) << 3);

Here we have 4 AND's that can be eliminated at comptime because y is -1. We can do the 3 shifts in parallel on a typical 4-wide machine and then we have two cycles of XOR. That's 3 cycles, however, we have no data dependencies with the z variables, so we should be able to overlap with the multiplications. 0 added cycles.

return (z0 & 0x11111111111111111111111111111111) ^
        (z1 & 0x22222222222222222222222222222222) ^
        (z2 & 0x44444444444444444444444444444444) ^
        (z3 & 0x88888888888888888888888888888888) ^ extra;

We could hypothetically do most of this work while waiting for the multiplications, reducing to (z3 & constant) ^ other_stuff_xored. Adds probably 2 cycles.

Put together, assuming the 128-bit operations take the same amount of time as 64-bit ops take, assuming constants can be loaded in a cycle, assuming register juggling doesn't add any extra cycles, and being able to start 1 multiplication per cycle and do 4 bitwise ops per cycle, it would take ~24 cycles with optimal scheduling. If you could start 2 multiplications per cycle, maybe ~16. Maybe you could knock that down to ~12 with 4 multiplications starting per cycle, but that's not what my computer can do, I'll tell you that right now.

Contrast that with the direct approach which trivially takes 12 cycles to compute (and a lot less electricity).

Now maybe if you were running this on truly ancient hardware without a barrel shifter, such that shifts were slow, and multiplies were fast, it might be a genuine competition for which method is less sluggish. Likewise if you had god-tier hardware that for some reason didn't have a carryless multiply, maybe you could emulate it in less than 12 cycles. But why would you have god-tier hardware without carryless multiply?

In conclusion, I think the direct approach is probably better than emulating carryless multiply to emulate prefix xor.

travisstaloch commented 1 year ago

Thank you for the detailed analysis! Its certainly a lot more than I was expecting. And forgive me for requesting a benchmark as that code path (fallback) currently isn't implemented elsewhere. Previously, I saw a PR without much of an explanation and didn't look at it closely.

I think you make a very compelling case for the patch which greatly simplifies the prefix_xor logic (+38 −118). So I'm happy to accept it. Perhaps a similar PR would be welcomed here, where this code originated?

Thanks so much! :pray:

Validark commented 1 year ago

This PR is not relevant to ghash because ghash is doing carryless multiplies by numbers that are not always -1.

travisstaloch / simdjzon

Don't emulate carryless multiply for prefix_xor #12