nietras / 1brc.cs

1️⃣🐝🏎️ The One Billion Row Challenge -- C# Edition -- nietras
Apache License 2.0
55 stars 16 forks source link

Benchmarking relative to your entry #4

Open noahfalk opened 7 months ago

noahfalk commented 7 months ago

Hey @nietras, I just wanted to let you know that I benchmarked my 1brc attempt against your implementation and mentioned you in my README. You've got the fastest established C# implementation I was aware of so it seemed like an important baseline to have. If you have any concerns or questions about what I wrote there (or if you'd like me to remove any of it), just let me know. Thanks! https://github.com/noahfalk/1brc/tree/main

nietras commented 7 months ago

@noahfalk thanks for getting in touch and the mention is fine. However, I cannot reproduce the rather poor performance you see for my implementation or rather if I run yours on 1B default then I get same perf as my own. I do see better numbers for the 10k real station names though, very nice! So not sure what is going on here if I compare to the readme numbers this does not match what I see on my machine after a quick run. So more testing is perhaps advisable. Perhaps ask to be included at https://github.com/buybackoff/1brc in https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-fastest-on-linux-my-optimization-journey/#results

As far as I can tell from a quick look your approach seems similar to mine but with quad unrolling and unrolling is also something I have looked at.

PS: I haven't looked at actual results.

nietras commented 7 months ago

I see you already made a request for buybackoff so I will wait and see the results for more :)

noahfalk commented 7 months ago

However, I cannot reproduce the rather poor performance you see for my implementation or rather if I run yours on 1B default then I get same perf as my own

Yeah I was a little surprised. When I ran our entries together on my Windows dev machine there was clearly a difference, but not as prominent as it wound up being on the CCX33 machine that I used for my pseudo-official results. Its always possible that I messed up something in the benchmarking but @buybackoff's results also seem to confirm a difference. My best guess is that we've got machine dependent factors at play and the results depend substantially on which hardware is making the measurement. If you are interested in posting any results from your own hardware I'm happy to link to create a more comprehensive picture of how different hardware is affecting results.

Thanks!

buybackoff commented 7 months ago

On the same cores as before I have this image

If I use all cores of my hybrid i5-13500, I get this:

image

It's time to get AWS metal spot instances... But I do not have time for that now :)

nietras commented 7 months ago

Incredible! I do think it's likely @noahfalk solution works a lot better on fewer cores vs mem bw, that's at least a theory, given my pc has 16c/32t and ddr4 dual channel the ratio of cores/mem bw is high. I'll try to rerun and post some numbers tomorrow.

nietras commented 7 months ago

Limiting max clock freq also will favor simd heavy stuff of course too, that's also why I think numbers differ from my machine where clock is not reduced. Unrolling and going wide simd makes sense in that case. Perhaps @noahfalk you can share full machine details of your local run. In any case I'd definitely think this beats anything else on the hetzner with high mem bw, no smt, and only 8 cores at lower clocks. Really impressive work.

noahfalk commented 7 months ago

If it helps here is info from the CCX33 machine where I ran my numbers:

root@ubuntu-32gb-hil-1:~/git/gunnarmorling_1brc# lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         40 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC-Milan Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            1
    BogoMIPS:            4890.80
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm r
                         ep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervi
                         sor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbas
                         e bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat umip pku o
                         spke rdpid fsrm
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    2 MiB (4 instances)
  L3:                    32 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Mitigation; safe RET
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Let me know if there is any other info you want me to grab from that machine that would help out.

nietras commented 7 months ago

@noahfalk sorry I thought you said you had a "Windows dev machine" where you saw a big difference. Anyway, on "default" 1B on AMD 5950X 16c32t, dual channel DDR4, Windows 10 the difference is minuscule (5%). Latter is mine.

image

image

On the 10k set there is a large difference but still only ~15%.

image

image

Hence, it is very machine dependent 😊

noahfalk commented 7 months ago

@noahfalk sorry I thought you said you had a "Windows dev machine"

Oh yes I do, I just didn't realize that is the machine you were talking about, my bad. I'll grab the numbers from that machine and a little machine info later tonight.

noahfalk commented 7 months ago

So for my windows dev machine it is a Core i7-9700K @ 3.6GHz, 8 cores, 8 hardware threads. I've measured it at ~14GB/s single threaded RAM bandwidth and ~19GB/s multi-threaded. This is not a terribly quiet machine and I think it may have thermal issues so I do not trust it as a high quality benchmarking environment.

Running our respective entries this is the performance I see at the moment:

C:\git\noahfalk_1brc\1brc\bin\Release\net8.0\win-x64\publish>hyperfine -w 2 -r 5 "1brc.exe C:\git\1brc_data\measurements.txt
Benchmark 1: 1brc.exe C:\git\1brc_data\measurements.txt
  Time (mean ± σ):      1.213 s ±  0.007 s    [User: 4.812 s, System: 3.728 s]
  Range (min … max):    1.207 s …  1.223 s    5 runs
C:\git\noahfalk_1brc\1brc\bin\Release\net8.0\win-x64\publish>hyperfine -w 2 -r 5 "1brc.exe C:\git\1brc_data\measurements-10K.txt
Benchmark 1: 1brc.exe C:\git\1brc_data\measurements-10K.txt
  Time (mean ± σ):      2.823 s ±  0.049 s    [User: 17.906 s, System: 3.589 s]
  Range (min … max):    2.771 s …  2.899 s    5 runs

C:\git\nietras_1brc\publish\Brc_AnyCPU_Release_net8.0_win-x64>hyperfine -w 2 -r 5 "Brc.exe C:\git\1brc_data\measurements.txt
Benchmark 1: Brc.exe C:\git\1brc_data\measurements.txt
  Time (mean ± σ):      2.580 s ±  0.029 s    [User: 11.356 s, System: 4.444 s]
  Range (min … max):    2.547 s …  2.609 s    5 runs
C:\git\nietras_1brc\publish\Brc_AnyCPU_Release_net8.0_win-x64>hyperfine -w 2 -r 5 "Brc.exe C:\git\1brc_data\measurements-10K.txt
Benchmark 1: Brc.exe C:\git\1brc_data\measurements-10K.txt
  Time (mean ± σ):      4.402 s ±  0.038 s    [User: 24.646 s, System: 4.770 s]
  Range (min … max):    4.355 s …  4.460 s    5 runs

C:\git\nietras_1brc>git log
commit 3230222926367c9d64e4990e61b762750cad6b2f (HEAD -> main, origin/main, origin/HEAD)
Author: nietras <nietras@users.noreply.github.com>
Date:   Sun Jan 14 12:42:31 2024 +0100

    Use prime for hash map capacity and magic number remainder for index (#3)
lehuyduc commented 7 months ago

I think all of us need to share our datasets :rofl: Both ./create_measurements.sh and ./create_measurements3.sh are non-deterministic, so everyone has their own datasets and optimize our hash tables for the ones we have.

Personally, I picked my hash table constants by running ./create_measurements3.sh 5 times, then find a prime number that has low average collision in all 5 datasets. If I test more, I can definitely find inputs that cause my code to have 10-20% more hash collision over average.

At the end of the contest I'll also help run more people's solutions, so all of us can have more data points to compare.