Open pcordes opened 9 months ago
Thanks for the many suggestions! I'll work on them.
I just ran into the stackoverflow question after googling how to mask SIMD bits :D
My ideas around branchless parsing are here (but you probably already read those).
But I also tried some other things earlier that I may revisit now that I'm also supporting longer city names:
\n
, covert to u32
mask, and count trailing zeros.u64
mask, and ctz
on that.Oh, also just came across this bit of GxHash: https://github.com/ogxd/gxhash/blob/main/src/gxhash/platform/x86.rs#L37
As mentioned already, one can do two of these in parallel, make a u64 mask, and ctz on that.
Yeah currently I only use __m128i
, I'll need to switch to __m256i
next.
mm256 separators = array of ';'
uint64_t mask = compare(data[0:64], separators);
uint32_t pos = ctz(mask);
if (likely(pos <= 64)) { ... }
else {
while (data[pos] != ';') pos++;
...
}
I think there's no way to avoid some branches if you want to handle all valid inputs? This can only be branchless if key_length < 64
You could unconditionally compare 128 characters ;) If you do all the Simd first, that reduces to two u64 ctz operations instead of 1, which may be faster than an occasional branch miss. But the number of cities with lengt > 64 is gonna be relatively low so probably the branch-miss is preferable.
Re: reading past the end of the buffer: just allocate at least 15 extra bytes in your buffer so you can safely do a 16-byte load from the last byte of actual data sumchars should use _mm_shuffle_epi32 for better non-AVX performance (avoid a movdqa), or _mm_unpackhi_epi64 for AVX (avoid an immediate). Use _mm_cvtsi128_si64 to get the low element with movq.
I've fixed those in earlier versions, thanks! Also changed loadu
into load
where possible.
Am I reading this right that hmap_insert reloads the key from data, regenerates the mask, and redoes masking?
Even in the new version, removing it doesn't make the code faster. I don't have time to test it properly yet, but I'll leave it until later. I think the compiler already removed them.
Like 2x vpcmpeqb / vpmovmskb / shl / or to create a 64-bit map of the upcoming newlines, then loop over that with blsr / tzcnt, so the loop-carried dep chain is just blsr until the next load based on that.
I'm working on this next. But there's 1 thing I'm not sure yet. So I have already computed uint64_t separator_mask
and uint64_t newline_mask
:
__m256i bytes32_0 = _mm256_loadu_si256((__m256i*)data);
__m256i bytes32_1 = _mm256_loadu_si256((__m256i*)(data + 32));
// compute separator_mask and newline_mask.
for (int t = 1; t <= 4; t++) {
int pos = __builtin_ctz(separator_mask);
separator_mask &= separator_mask - 1; // remove lowest bit
newline_mask &= newline_mask - 1;
__m128i chars = _mm_loadu_si128((__m128i*)(data + pos)); <= this part
}
...
data_idx += __builtin_ctz(newline_mask);
In the first iteration, I can use bytes32_0
to compute stuffs. But from the second iteration, do I have to load them like normal again? Or is there some other ways? For example if the 2nd line is contained between bytes32_0
and bytes32_1
@pcordes Thanks for the suggestion! 64-bit separator mask improves total performance by another 3-5% ish, but only on Zen 2 (and probably higher, tested on Zen 2). The effect is effect at lower thread numbers, which I guess make sense because we're basically trying to do more work manually per thread.
Lesson: stop relying on Zen 1 to benchmark.
@RagnarGrootKoerkamp commented:
- We can also first preprocess a chunk of say 10-100kb of data and convert it to bitmasks stored in a separate vector. Then, the loop over data can be a bit simpler, and do unaligned u64 reads on this, on which you can count trailing zeros for any position 0 mod 8. As it turns out, all lines in the eval input have at least 8 characters, so there is at most a single '1' mask bit in each byte, making this work. (But for very short key names you'd need a workaround.)
Good idea, although you might want to cache-block for L1d cache size. Perhaps L2 size for the string is fine since the benefit is avoiding load-use latency as part of a short dependency chain, and not having to branch unpredictably when you run out of set bits in a short mask from one or two vectors. Maybe separate masks of newline and :
position? Or maybe a single mask of matches for both \n
and :
, and assume they strictly alternate.
You could do something like 2 or 3 keys out of a u64, then mask >>= ctz
it to put the lowest set bit at the bottom and mask |= (nextmask << (64-ctz));
Or something like that, maybe not that simple if we need to keep track of the bit-position within the next u64 chunk that we haven't already consumed? Oh right, like you said, mask the shift count to a multiple of 8 so we can do an unaligned u64 load. And hopefully software-pipeline it somehow to hide that latency, if the hashing and inserting work don't hide it. Only doing it every few keys amortizes, and pure integer not SIMD + movemask keeps the dep chain shorter.
You could multi-thread this, with one thread generating the next chunk of masks while another thread consumes the masks and inserts. So the insert thread is reading data that was written a while ago by another thread, and the string data is hot in L3 (if they're on the same CCX for Zen). Hardware prefetch should do well.
As discussed on Stack Overflow, I played around with the code a tiny bit to fix some warnings. https://godbolt.org/z/bdzoEP8b8 Maybe diff that against 1brc_valid14.cpp which is what I started with, to see what I changed. (IIRC I made comments on most changes.) Nothing huge, just minor notes like trying memset(..., 0, 100) fixed-size first. GCC uses rep stosq for -march=znver1, which seems crazy to me. Clang uses 3x vmovdqu ymm, which is good if aligned.
There's other stuff that could be done, like avoiding narrow signed types to save a
movsxd
orcdqe
sign-extension instruction in some places, but that's probably minor.Moving my comments from SO to here so a moderator is less likely to swoop in and destroy all the comments, moving them to chat where their vote totals are lost and only a tiny fraction of readers will ever look at them. (I hate it when moderators do that.)
(I saw your response to that one already)
Re: reading past the end of the buffer: just allocate at least 15 extra bytes in your buffer so you can safely do a 16-byte load from the last byte of actual data. I think
mmap
should allow a larger mapping, at least with MAP_PRIVATE. If not, usemmap(MAP_FIXED_NOREPLACE)
to map another page after the file-backed mapping to make sure a mapping exists. I guess it's possible the next page could be mapped but not readable, like a guard page for stack growth, in which case you're out of luck if the file length is within 15 bytes of the end of a page. It's always safe to read past the end of the file length as long as that's not into a new page. https://unix.stackexchange.com/questions/616848/what-is-the-behaviour-of-a-file-backed-memory-map-when-reading-from-or-writing-tsumchars
should use_mm_shuffle_epi32
for better non-AVX performance (avoid amovdqa
), or_mm_unpackhi_epi64
for AVX (avoid an immediate). See https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reductionPointing
uint64_t*
at a__m128i
is strict-aliasing UB. In 64-bit code for Linux,uint64_t
isunsigned long
which isn't even alias-compatible withlong long
(which is the element type for__m128i
.) Use_mm_cvtsi128_si64
to get the low element withmovq
.hmap_insert
reloads the key from data, regenerates the mask, and redoes masking? You could just pass it the vector variables you already have, at least for the non-SLOW_HASH version. (Which you can get rid of if you allocate or mmap a slightly larger buffer, otherwise I guess declare the vector variables outside the if block so you can still pass them and just not use them in one side of hash_insert)You claim you tried but found it slower. IDK, if so that might change if you eliminate the SLOW_HASH case or whatever other branching makes the compiler not sure those variables are already computed. Or look at the asm and see what looks inefficient to figure out why it would be slower to do less work. I don't see any non-inline function calls so the variables should still be hot in regs. Maybe the compiler moved the work later, worsening critical-path latency for stuff that wants
pos
ready sooner?If you're working your way through a buffer with pointer updates based only on
pos
, that dependency-chain latency could be a bottleneck. Loading 2 YMM vectors and scanning it for newlines could give you a shorter critical path latency. Like 2xvpcmpeqb
/vpmovmskb
/shl
/or
to create a 64-bit map of the upcoming newlines, then loop over that withblsr
/tzcnt
, so the loop-carried dep chain is justblsr
until the next load based on that. Unfortunatelyblsr
is 2 uops on Zen 3 and earlier, but it's still pretty cheap, and might save some of the computation based onpos
so end up being break-even.Perhaps align the YMM loads by 16 for efficiency on Zen 1, and right-shift the resulting 64-bit mask by 0..15, using the low 4 bits of the address as a shift count. https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#load-split-cache-lines .
If you go until you get all the newlines from the window, that loop branch will probably mispredict often. So perhaps set a limit on the number of lines from that 64-byte window, such that most of the time the loop does that many iterations, only occasionally stopping early if the lines were very long. Like 4, 5, or 6 lines, something that's rarely exceeded in your typical data.
If pointer-increment critical path latency wasn't hurting out-of-order execution, this is extra work for no benefit that costs throughput resources.
From another comment thread on SO under Simon's answer:
You could just use
calloc
to get zeroed memory. It's either zeroed already by the kernel, or a wider zeroing operation that you overwrite with memcpy has negligible extra cost. (Doing the zeroing yourself with a compile-time-constant size might even be better, but for the calloc library function the length is a runtime variable. OTOH, it's always the same size so branching goes the same way.)Your code has at least a couple bugs. IDK if they affect performance or not. GCC/clang warn about memset size being a constant 0; you swapped the value and length args. Also, you deref a
uint64_t*
pointing at a__m128i
. That's the opposite of dereferencing a__m128i*
, and is not well-defined, like I wrote in the SO answer linked in a comment in the code! (Anduint64_t
akaunsigned long
in a 64-bit build is not alias-compatible withlong long
.) It breaks in practice forint*
. Use_mm_cvtsi128_si64
for vmovq to get the low half!