Open noahfalk opened 7 months ago
@noahfalk thanks for getting in touch and the mention is fine. However, I cannot reproduce the rather poor performance you see for my implementation or rather if I run yours on 1B default then I get same perf as my own. I do see better numbers for the 10k real station names though, very nice! So not sure what is going on here if I compare to the readme numbers this does not match what I see on my machine after a quick run. So more testing is perhaps advisable. Perhaps ask to be included at https://github.com/buybackoff/1brc in https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-fastest-on-linux-my-optimization-journey/#results
As far as I can tell from a quick look your approach seems similar to mine but with quad unrolling and unrolling is also something I have looked at.
PS: I haven't looked at actual results.
I see you already made a request for buybackoff so I will wait and see the results for more :)
However, I cannot reproduce the rather poor performance you see for my implementation or rather if I run yours on 1B default then I get same perf as my own
Yeah I was a little surprised. When I ran our entries together on my Windows dev machine there was clearly a difference, but not as prominent as it wound up being on the CCX33 machine that I used for my pseudo-official results. Its always possible that I messed up something in the benchmarking but @buybackoff's results also seem to confirm a difference. My best guess is that we've got machine dependent factors at play and the results depend substantially on which hardware is making the measurement. If you are interested in posting any results from your own hardware I'm happy to link to create a more comprehensive picture of how different hardware is affecting results.
Thanks!
On the same cores as before I have this
If I use all cores of my hybrid i5-13500, I get this:
It's time to get AWS metal spot instances... But I do not have time for that now :)
Incredible! I do think it's likely @noahfalk solution works a lot better on fewer cores vs mem bw, that's at least a theory, given my pc has 16c/32t and ddr4 dual channel the ratio of cores/mem bw is high. I'll try to rerun and post some numbers tomorrow.
Limiting max clock freq also will favor simd heavy stuff of course too, that's also why I think numbers differ from my machine where clock is not reduced. Unrolling and going wide simd makes sense in that case. Perhaps @noahfalk you can share full machine details of your local run. In any case I'd definitely think this beats anything else on the hetzner with high mem bw, no smt, and only 8 cores at lower clocks. Really impressive work.
If it helps here is info from the CCX33 machine where I ran my numbers:
root@ubuntu-32gb-hil-1:~/git/gunnarmorling_1brc# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 40 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: AuthenticAMD
Model name: AMD EPYC-Milan Processor
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 1
BogoMIPS: 4890.80
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm r
ep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervi
sor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbas
e bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat umip pku o
spke rdpid fsrm
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 2 MiB (4 instances)
L3: 32 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Mitigation; safe RET
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
Let me know if there is any other info you want me to grab from that machine that would help out.
@noahfalk sorry I thought you said you had a "Windows dev machine" where you saw a big difference. Anyway, on "default" 1B on AMD 5950X 16c32t, dual channel DDR4, Windows 10 the difference is minuscule (5%). Latter is mine.
On the 10k set there is a large difference but still only ~15%.
Hence, it is very machine dependent 😊
@noahfalk sorry I thought you said you had a "Windows dev machine"
Oh yes I do, I just didn't realize that is the machine you were talking about, my bad. I'll grab the numbers from that machine and a little machine info later tonight.
So for my windows dev machine it is a Core i7-9700K @ 3.6GHz, 8 cores, 8 hardware threads. I've measured it at ~14GB/s single threaded RAM bandwidth and ~19GB/s multi-threaded. This is not a terribly quiet machine and I think it may have thermal issues so I do not trust it as a high quality benchmarking environment.
Running our respective entries this is the performance I see at the moment:
C:\git\noahfalk_1brc\1brc\bin\Release\net8.0\win-x64\publish>hyperfine -w 2 -r 5 "1brc.exe C:\git\1brc_data\measurements.txt
Benchmark 1: 1brc.exe C:\git\1brc_data\measurements.txt
Time (mean ± σ): 1.213 s ± 0.007 s [User: 4.812 s, System: 3.728 s]
Range (min … max): 1.207 s … 1.223 s 5 runs
C:\git\noahfalk_1brc\1brc\bin\Release\net8.0\win-x64\publish>hyperfine -w 2 -r 5 "1brc.exe C:\git\1brc_data\measurements-10K.txt
Benchmark 1: 1brc.exe C:\git\1brc_data\measurements-10K.txt
Time (mean ± σ): 2.823 s ± 0.049 s [User: 17.906 s, System: 3.589 s]
Range (min … max): 2.771 s … 2.899 s 5 runs
C:\git\nietras_1brc\publish\Brc_AnyCPU_Release_net8.0_win-x64>hyperfine -w 2 -r 5 "Brc.exe C:\git\1brc_data\measurements.txt
Benchmark 1: Brc.exe C:\git\1brc_data\measurements.txt
Time (mean ± σ): 2.580 s ± 0.029 s [User: 11.356 s, System: 4.444 s]
Range (min … max): 2.547 s … 2.609 s 5 runs
C:\git\nietras_1brc\publish\Brc_AnyCPU_Release_net8.0_win-x64>hyperfine -w 2 -r 5 "Brc.exe C:\git\1brc_data\measurements-10K.txt
Benchmark 1: Brc.exe C:\git\1brc_data\measurements-10K.txt
Time (mean ± σ): 4.402 s ± 0.038 s [User: 24.646 s, System: 4.770 s]
Range (min … max): 4.355 s … 4.460 s 5 runs
C:\git\nietras_1brc>git log
commit 3230222926367c9d64e4990e61b762750cad6b2f (HEAD -> main, origin/main, origin/HEAD)
Author: nietras <nietras@users.noreply.github.com>
Date: Sun Jan 14 12:42:31 2024 +0100
Use prime for hash map capacity and magic number remainder for index (#3)
I think all of us need to share our datasets :rofl: Both ./create_measurements.sh
and ./create_measurements3.sh
are non-deterministic, so everyone has their own datasets and optimize our hash tables for the ones we have.
Personally, I picked my hash table constants by running ./create_measurements3.sh
5 times, then find a prime number that has low average collision in all 5 datasets. If I test more, I can definitely find inputs that cause my code to have 10-20% more hash collision over average.
At the end of the contest I'll also help run more people's solutions, so all of us can have more data points to compare.
Hey @nietras, I just wanted to let you know that I benchmarked my 1brc attempt against your implementation and mentioned you in my README. You've got the fastest established C# implementation I was aware of so it seemed like an important baseline to have. If you have any concerns or questions about what I wrote there (or if you'd like me to remove any of it), just let me know. Thanks! https://github.com/noahfalk/1brc/tree/main