lehuyduc / 1brc-simd

Process 1 billion row of text data as fast as possible
36 stars 5 forks source link

Some possible improvements #3

Open pcordes opened 9 months ago

pcordes commented 9 months ago

As discussed on Stack Overflow, I played around with the code a tiny bit to fix some warnings. https://godbolt.org/z/bdzoEP8b8 Maybe diff that against 1brc_valid14.cpp which is what I started with, to see what I changed. (IIRC I made comments on most changes.) Nothing huge, just minor notes like trying memset(..., 0, 100) fixed-size first. GCC uses rep stosq for -march=znver1, which seems crazy to me. Clang uses 3x vmovdqu ymm, which is good if aligned.

There's other stuff that could be done, like avoiding narrow signed types to save a movsxd or cdqe sign-extension instruction in some places, but that's probably minor.

Moving my comments from SO to here so a moderator is less likely to swoop in and destroy all the comments, moving them to chat where their vote totals are lost and only a tiny fraction of readers will ever look at them. (I hate it when moderators do that.)

(I saw your response to that one already)


If you're working your way through a buffer with pointer updates based only on pos, that dependency-chain latency could be a bottleneck. Loading 2 YMM vectors and scanning it for newlines could give you a shorter critical path latency. Like 2x vpcmpeqb / vpmovmskb / shl / or to create a 64-bit map of the upcoming newlines, then loop over that with blsr / tzcnt, so the loop-carried dep chain is just blsr until the next load based on that. Unfortunately blsr is 2 uops on Zen 3 and earlier, but it's still pretty cheap, and might save some of the computation based on pos so end up being break-even.

Perhaps align the YMM loads by 16 for efficiency on Zen 1, and right-shift the resulting 64-bit mask by 0..15, using the low 4 bits of the address as a shift count. https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#load-split-cache-lines .

If you go until you get all the newlines from the window, that loop branch will probably mispredict often. So perhaps set a limit on the number of lines from that 64-byte window, such that most of the time the loop does that many iterations, only occasionally stopping early if the lines were very long. Like 4, 5, or 6 lines, something that's rarely exceeded in your typical data.

If pointer-increment critical path latency wasn't hurting out-of-order execution, this is extra work for no benefit that costs throughput resources.


From another comment thread on SO under Simon's answer:

lehuyduc commented 9 months ago

Thanks for the many suggestions! I'll work on them.

RagnarGrootKoerkamp commented 9 months ago

I just ran into the stackoverflow question after googling how to mask SIMD bits :D

My ideas around branchless parsing are here (but you probably already read those).

But I also tried some other things earlier that I may revisit now that I'm also supporting longer city names:

RagnarGrootKoerkamp commented 9 months ago

Oh, also just came across this bit of GxHash: https://github.com/ogxd/gxhash/blob/main/src/gxhash/platform/x86.rs#L37

lehuyduc commented 9 months ago

As mentioned already, one can do two of these in parallel, make a u64 mask, and ctz on that. Yeah currently I only use __m128i, I'll need to switch to __m256i next.

mm256 separators = array of ';'
uint64_t mask = compare(data[0:64], separators);
uint32_t pos = ctz(mask);
if (likely(pos <= 64)) { ... }
else {
  while (data[pos] != ';') pos++;
  ...
}

I think there's no way to avoid some branches if you want to handle all valid inputs? This can only be branchless if key_length < 64

RagnarGrootKoerkamp commented 9 months ago

You could unconditionally compare 128 characters ;) If you do all the Simd first, that reduces to two u64 ctz operations instead of 1, which may be faster than an occasional branch miss. But the number of cities with lengt > 64 is gonna be relatively low so probably the branch-miss is preferable.

lehuyduc commented 9 months ago

Re: reading past the end of the buffer: just allocate at least 15 extra bytes in your buffer so you can safely do a 16-byte load from the last byte of actual data sumchars should use _mm_shuffle_epi32 for better non-AVX performance (avoid a movdqa), or _mm_unpackhi_epi64 for AVX (avoid an immediate). Use _mm_cvtsi128_si64 to get the low element with movq.

I've fixed those in earlier versions, thanks! Also changed loadu into load where possible.

Am I reading this right that hmap_insert reloads the key from data, regenerates the mask, and redoes masking?

Even in the new version, removing it doesn't make the code faster. I don't have time to test it properly yet, but I'll leave it until later. I think the compiler already removed them.

Like 2x vpcmpeqb / vpmovmskb / shl / or to create a 64-bit map of the upcoming newlines, then loop over that with blsr / tzcnt, so the loop-carried dep chain is just blsr until the next load based on that.

I'm working on this next. But there's 1 thing I'm not sure yet. So I have already computed uint64_t separator_mask and uint64_t newline_mask:

__m256i bytes32_0 = _mm256_loadu_si256((__m256i*)data);
__m256i bytes32_1 = _mm256_loadu_si256((__m256i*)(data + 32));
// compute separator_mask and newline_mask.
for (int t = 1; t <= 4; t++) {
  int pos = __builtin_ctz(separator_mask);
  separator_mask &= separator_mask - 1; // remove lowest bit
  newline_mask &= newline_mask - 1;

  __m128i chars = _mm_loadu_si128((__m128i*)(data + pos)); <= this part
}
...
data_idx += __builtin_ctz(newline_mask);

In the first iteration, I can use bytes32_0 to compute stuffs. But from the second iteration, do I have to load them like normal again? Or is there some other ways? For example if the 2nd line is contained between bytes32_0 and bytes32_1

lehuyduc commented 9 months ago

@pcordes Thanks for the suggestion! 64-bit separator mask improves total performance by another 3-5% ish, but only on Zen 2 (and probably higher, tested on Zen 2). The effect is effect at lower thread numbers, which I guess make sense because we're basically trying to do more work manually per thread.

Lesson: stop relying on Zen 1 to benchmark.

pcordes commented 9 months ago

@RagnarGrootKoerkamp commented:

  • We can also first preprocess a chunk of say 10-100kb of data and convert it to bitmasks stored in a separate vector. Then, the loop over data can be a bit simpler, and do unaligned u64 reads on this, on which you can count trailing zeros for any position 0 mod 8. As it turns out, all lines in the eval input have at least 8 characters, so there is at most a single '1' mask bit in each byte, making this work. (But for very short key names you'd need a workaround.)

Good idea, although you might want to cache-block for L1d cache size. Perhaps L2 size for the string is fine since the benefit is avoiding load-use latency as part of a short dependency chain, and not having to branch unpredictably when you run out of set bits in a short mask from one or two vectors. Maybe separate masks of newline and : position? Or maybe a single mask of matches for both \n and :, and assume they strictly alternate.

You could do something like 2 or 3 keys out of a u64, then mask >>= ctz it to put the lowest set bit at the bottom and mask |= (nextmask << (64-ctz)); Or something like that, maybe not that simple if we need to keep track of the bit-position within the next u64 chunk that we haven't already consumed? Oh right, like you said, mask the shift count to a multiple of 8 so we can do an unaligned u64 load. And hopefully software-pipeline it somehow to hide that latency, if the hashing and inserting work don't hide it. Only doing it every few keys amortizes, and pure integer not SIMD + movemask keeps the dep chain shorter.


You could multi-thread this, with one thread generating the next chunk of masks while another thread consumes the masks and inserts. So the insert thread is reading data that was written a while ago by another thread, and the string data is hot in L3 (if they're on the same CCX for Zen). Hardware prefetch should do well.