Some possible improvements

pcordes commented 9 months ago

As discussed on Stack Overflow, I played around with the code a tiny bit to fix some warnings. https://godbolt.org/z/bdzoEP8b8 Maybe diff that against 1brc_valid14.cpp which is what I started with, to see what I changed. (IIRC I made comments on most changes.) Nothing huge, just minor notes like trying memset(..., 0, 100) fixed-size first. GCC uses rep stosq for -march=znver1, which seems crazy to me. Clang uses 3x vmovdqu ymm, which is good if aligned.

There's other stuff that could be done, like avoiding narrow signed types to save a movsxd or cdqe sign-extension instruction in some places, but that's probably minor.

Moving my comments from SO to here so a moderator is less likely to swoop in and destroy all the comments, moving them to chat where their vote totals are lost and only a tiny fraction of readers will ever look at them. (I hate it when moderators do that.)

Why does your code use _mm_testc_si128 with a comment that says "SIMD string comparison"? That only checks that all the bits set in the right side are set in the left side. Two ASCII strings can differ while testc is still true, e.g. "ccc..." (0x63) and "aaa..." (0x61). If that works for your use-case or is intended as an early-out approximation, it deserves a bigger comment somewhere. ptest is 2 uops on most CPUs (but one on Zen 1 & 2) vs. pcmpeqb/pmovmskb also being 2 uops. test/jcc fuses into 1 uop even on AMD. Different code-size for code cache, though.

(I saw your response to that one already)

Re: reading past the end of the buffer: just allocate at least 15 extra bytes in your buffer so you can safely do a 16-byte load from the last byte of actual data. I think mmap should allow a larger mapping, at least with MAP_PRIVATE. If not, use mmap(MAP_FIXED_NOREPLACE) to map another page after the file-backed mapping to make sure a mapping exists. I guess it's possible the next page could be mapped but not readable, like a guard page for stack growth, in which case you're out of luck if the file length is within 15 bytes of the end of a page. It's always safe to read past the end of the file length as long as that's not into a new page. https://unix.stackexchange.com/questions/616848/what-is-the-behaviour-of-a-file-backed-memory-map-when-reading-from-or-writing-t
sumchars should use _mm_shuffle_epi32 for better non-AVX performance (avoid a movdqa), or _mm_unpackhi_epi64 for AVX (avoid an immediate). See https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction
Pointing uint64_t* at a __m128i is strict-aliasing UB. In 64-bit code for Linux, uint64_t is unsigned long which isn't even alias-compatible with long long (which is the element type for __m128i.) Use _mm_cvtsi128_si64 to get the low element with movq.
- Am I reading this right that hmap_insert reloads the key from data, regenerates the mask, and redoes masking? You could just pass it the vector variables you already have, at least for the non-SLOW_HASH version. (Which you can get rid of if you allocate or mmap a slightly larger buffer, otherwise I guess declare the vector variables outside the if block so you can still pass them and just not use them in one side of hash_insert)
You claim you tried but found it slower. IDK, if so that might change if you eliminate the SLOW_HASH case or whatever other branching makes the compiler not sure those variables are already computed. Or look at the asm and see what looks inefficient to figure out why it would be slower to do less work. I don't see any non-inline function calls so the variables should still be hot in regs. Maybe the compiler moved the work later, worsening critical-path latency for stuff that wants pos ready sooner?

If you're working your way through a buffer with pointer updates based only on pos, that dependency-chain latency could be a bottleneck. Loading 2 YMM vectors and scanning it for newlines could give you a shorter critical path latency. Like 2x vpcmpeqb / vpmovmskb / shl / or to create a 64-bit map of the upcoming newlines, then loop over that with blsr / tzcnt, so the loop-carried dep chain is just blsr until the next load based on that. Unfortunately blsr is 2 uops on Zen 3 and earlier, but it's still pretty cheap, and might save some of the computation based on pos so end up being break-even.

Perhaps align the YMM loads by 16 for efficiency on Zen 1, and right-shift the resulting 64-bit mask by 0..15, using the low 4 bits of the address as a shift count. https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#load-split-cache-lines .

If you go until you get all the newlines from the window, that loop branch will probably mispredict often. So perhaps set a limit on the number of lines from that 64-byte window, such that most of the time the loop does that many iterations, only occasionally stopping early if the lines were very long. Like 4, 5, or 6 lines, something that's rarely exceeded in your typical data.

If pointer-increment critical path latency wasn't hurting out-of-order execution, this is extra work for no benefit that costs throughput resources.

From another comment thread on SO under Simon's answer:

You could just use calloc to get zeroed memory. It's either zeroed already by the kernel, or a wider zeroing operation that you overwrite with memcpy has negligible extra cost. (Doing the zeroing yourself with a compile-time-constant size might even be better, but for the calloc library function the length is a runtime variable. OTOH, it's always the same size so branching goes the same way.)
Your code has at least a couple bugs. IDK if they affect performance or not. GCC/clang warn about memset size being a constant 0; you swapped the value and length args. Also, you deref a uint64_t* pointing at a __m128i. That's the opposite of dereferencing a __m128i*, and is not well-defined, like I wrote in the SO answer linked in a comment in the code! (And uint64_t aka unsigned long in a 64-bit build is not alias-compatible with long long.) It breaks in practice for int*. Use _mm_cvtsi128_si64 for vmovq to get the low half!

lehuyduc commented 9 months ago

Thanks for the many suggestions! I'll work on them.

RagnarGrootKoerkamp commented 9 months ago

I just ran into the stackoverflow question after googling how to mask SIMD bits :D

My ideas around branchless parsing are here (but you probably already read those).

But I also tried some other things earlier that I may revisit now that I'm also supporting longer city names:

Currently I do a single unaligned read, compare to a mask of \n, covert to u32 mask, and count trailing zeros.
As mentioned already, one can do two of these in parallel, make a u64 mask, and ctz on that.
We can also first preprocess a chunk of say 10-100kb of data and convert it to bitmasks stored in a separate vector. Then, the loop over data can be a bit simpler, and do unaligned u64 reads on this, on which you can count trailing zeros for any position 0 mod 8. As it turns out, all lines in the eval input have at least 8 characters, so there is at most a single '1' mask bit in each byte, making this work. (But for very short key names you'd need a workaround.)

RagnarGrootKoerkamp commented 9 months ago

Oh, also just came across this bit of GxHash: https://github.com/ogxd/gxhash/blob/main/src/gxhash/platform/x86.rs#L37

lehuyduc commented 9 months ago

As mentioned already, one can do two of these in parallel, make a u64 mask, and ctz on that. Yeah currently I only use __m128i, I'll need to switch to __m256i next.

mm256 separators = array of ';'
uint64_t mask = compare(data[0:64], separators);
uint32_t pos = ctz(mask);
if (likely(pos <= 64)) { ... }
else {
  while (data[pos] != ';') pos++;
  ...
}

I think there's no way to avoid some branches if you want to handle all valid inputs? This can only be branchless if key_length < 64

RagnarGrootKoerkamp commented 9 months ago

You could unconditionally compare 128 characters ;) If you do all the Simd first, that reduces to two u64 ctz operations instead of 1, which may be faster than an occasional branch miss. But the number of cities with lengt > 64 is gonna be relatively low so probably the branch-miss is preferable.

lehuyduc commented 9 months ago

Re: reading past the end of the buffer: just allocate at least 15 extra bytes in your buffer so you can safely do a 16-byte load from the last byte of actual data sumchars should use _mm_shuffle_epi32 for better non-AVX performance (avoid a movdqa), or _mm_unpackhi_epi64 for AVX (avoid an immediate). Use _mm_cvtsi128_si64 to get the low element with movq.

I've fixed those in earlier versions, thanks! Also changed loadu into load where possible.

Am I reading this right that hmap_insert reloads the key from data, regenerates the mask, and redoes masking?

Even in the new version, removing it doesn't make the code faster. I don't have time to test it properly yet, but I'll leave it until later. I think the compiler already removed them.

Like 2x vpcmpeqb / vpmovmskb / shl / or to create a 64-bit map of the upcoming newlines, then loop over that with blsr / tzcnt, so the loop-carried dep chain is just blsr until the next load based on that.

I'm working on this next. But there's 1 thing I'm not sure yet. So I have already computed uint64_t separator_mask and uint64_t newline_mask:

__m256i bytes32_0 = _mm256_loadu_si256((__m256i*)data);
__m256i bytes32_1 = _mm256_loadu_si256((__m256i*)(data + 32));
// compute separator_mask and newline_mask.
for (int t = 1; t <= 4; t++) {
  int pos = __builtin_ctz(separator_mask);
  separator_mask &= separator_mask - 1; // remove lowest bit
  newline_mask &= newline_mask - 1;

  __m128i chars = _mm_loadu_si128((__m128i*)(data + pos)); <= this part
}
...
data_idx += __builtin_ctz(newline_mask);

In the first iteration, I can use bytes32_0 to compute stuffs. But from the second iteration, do I have to load them like normal again? Or is there some other ways? For example if the 2nd line is contained between bytes32_0 and bytes32_1

lehuyduc commented 9 months ago

@pcordes Thanks for the suggestion! 64-bit separator mask improves total performance by another 3-5% ish, but only on Zen 2 (and probably higher, tested on Zen 2). The effect is effect at lower thread numbers, which I guess make sense because we're basically trying to do more work manually per thread.

Lesson: stop relying on Zen 1 to benchmark.

pcordes commented 9 months ago

@RagnarGrootKoerkamp commented:

We can also first preprocess a chunk of say 10-100kb of data and convert it to bitmasks stored in a separate vector. Then, the loop over data can be a bit simpler, and do unaligned u64 reads on this, on which you can count trailing zeros for any position 0 mod 8. As it turns out, all lines in the eval input have at least 8 characters, so there is at most a single '1' mask bit in each byte, making this work. (But for very short key names you'd need a workaround.)

Good idea, although you might want to cache-block for L1d cache size. Perhaps L2 size for the string is fine since the benefit is avoiding load-use latency as part of a short dependency chain, and not having to branch unpredictably when you run out of set bits in a short mask from one or two vectors. Maybe separate masks of newline and : position? Or maybe a single mask of matches for both \n and :, and assume they strictly alternate.

You could do something like 2 or 3 keys out of a u64, then mask >>= ctz it to put the lowest set bit at the bottom and mask |= (nextmask << (64-ctz)); Or something like that, maybe not that simple if we need to keep track of the bit-position within the next u64 chunk that we haven't already consumed? Oh right, like you said, mask the shift count to a multiple of 8 so we can do an unaligned u64 load. And hopefully software-pipeline it somehow to hide that latency, if the hashing and inserting work don't hide it. Only doing it every few keys amortizes, and pure integer not SIMD + movemask keeps the dep chain shorter.

You could multi-thread this, with one thread generating the next chunk of masks while another thread consumes the masks and inserts. So the insert thread is reading data that was written a while ago by another thread, and the string data is hot in L3 (if they're on the same CCX for Zen). Hardware prefetch should do well.

lehuyduc / 1brc-simd

Some possible improvements #3