lehuyduc / 1brc-simd

Process 1 billion row of text data as fast as possible
36 stars 5 forks source link

Added to stable comparison #1

Open buybackoff opened 10 months ago

buybackoff commented 10 months ago

So far the top result overall. Great job!

https://github.com/buybackoff/1brc?tab=readme-ov-file#native

image

lehuyduc commented 10 months ago

Thanks! Can you run again but with N_THREADS set to number of threads (cores * 2) on the test machine? Running 128 threads on a 12 thread PC is slower than running just 12

lehuyduc commented 10 months ago

I've updated the repo so that run_cpp.sh automatically finds number of threads to use, and compile with that number as constexpr. Could you pull and run it again? Thanks!

AlexanderYastrebov commented 10 months ago

On my machine I get

$ chmod +x run_cpp.sh
$ ./run_cpp.sh 
Using 8 threads
init mmap file cost = 0.041405ms
Parallel process file cost = 18528.8ms
Aggregate stats cost = 1.22172ms
Output stats cost = 1.90048ms
Runtime inside main = 18532ms
Time to munmap = 263.052ms

real    0m18,809s
user    0m30,831s
sys     0m7,561s
867e3fb8c93a52eddb94bd99b8b87c1f355bd225a58249eba97ce4eb87eb58cc  result.txt

and for my go version https://github.com/AlexanderYastrebov/1brc/tree/go-implementation/src/main/go

$ (cd src/main/go/ && go build -o /tmp/1brc-go && time /tmp/1brc-go ../../../measurements-109.txt) | sha256sum 

real    0m20,917s
user    0m45,232s
sys     0m7,821s
867e3fb8c93a52eddb94bd99b8b87c1f355bd225a58249eba97ce4eb87eb58cc  -

@buybackoff would be possible to add a record for go implementation in your chart?

I proposed in https://github.com/gunnarmorling/1brc/discussions/253 to support Docker builds but this is out of scope for now as per https://github.com/gunnarmorling/1brc/pull/182#discussion_r1445841593

lehuyduc commented 10 months ago

That looks unusually slow o.O I get 13.5-14.5 with 1 thread on 2 separate test PC.

I think PC specs should be included any time a benchmark number is posted. Also any related inputs

AlexanderYastrebov commented 10 months ago

@lehuyduc Sure, that's why I would like to know how it stands compared to your version, C# and best java. PS: I posted my specs in the https://github.com/gunnarmorling/1brc/discussions/67

buybackoff commented 10 months ago

@buybackoff would be possible to add a record for go implementation in your chart?

@AlexanderYastrebov Will do eventually, not today/tomorrow.

@lehuyduc I'm reading your 1brc_final_valid.cpp and this is cool stuff. I would say [Samuel L. Jackson from Pulp Fiction]-level cool. And it's quite simple nevertheless, but the some tricks/hacks are beautiful. Very educating!

lehuyduc commented 10 months ago

@lehuyduc I'm reading your 1brc_final_valid.cpp and this is cool stuff. I would say [Samuel L. Jackson from Pulp Fiction]-level cool. And it's quite simple nevertheless, but the some tricks/hacks are beautiful. Very educating!

@buybackoff I'm not done yet, still has some tiny tricks left :D Also the latest version is ~28% faster compared to the last one you tested

buybackoff commented 9 months ago

@AlexanderYastrebov sorry but it's so inconvenient. Not only it's in Go and I need to download tools, it's inside this huge Java fork. If you think it's a real contender for top spots then please create a separate repo and instructions to build. Also the file path should be the first argument to an executable.

AlexanderYastrebov commented 9 months ago

@buybackoff Hi, no worries. I hope maybe I can get it into the main repo https://github.com/gunnarmorling/1brc/pull/298 to compare against java

tivrfoa commented 9 months ago

@AlexanderYastrebov sorry but it's so inconvenient. Not only it's in Go and I need to download tools, it's inside this huge Java fork. If you think it's a real contender for top spots then please create a separate repo and instructions to build. Also the file path should be the first argument to an executable.

You just need to download this file: https://github.com/gunnarmorling/1brc/blob/2e2699216bf21f0bf3639595724aeac540a9b555/src/main/go/AlexanderYastrebov/calc.go

go build calc.go (ridiculous fast! go compile time is amazing =))

./calc measurements.txt

I think this go version would be performing pretty well in this rank: https://hotforknowledge.com/2024/01/13/7-1brc-in-dotnet-even-faster-than-java-cpp/

It would be really nice if @buybackoff could run your solution.

buybackoff commented 9 months ago

@lehuyduc Hi, it looks like your yesterday version is much faster on 10K, but much slower on the default data. Is that expected?

run_cpp.sh expects main.cpp, but I used 1brc_final_valid.cpp as before.

lehuyduc commented 9 months ago

Hmm, I think this might be due to CPU difference. I tested the latest code on AMD 7502P directly instead of AMD 2950X like before. Both default + 10K results are faster on 7502P.

What results do you get when running the latest version?

buybackoff commented 9 months ago

The default goes from 1.649 to 1.792. Tried both 6/12 compilation.

10K goes from 3.081 to 2.863

Running only on P-cores of i5-13500.

lehuyduc commented 9 months ago

Hmm, I really don't know what cause this. It might even be due to Intel vs AMD difference. I tested on 2950X again, and it didn't cause any slowdown (but no speedup because the CPU is quite old).

I'll try looking into it later. But lower 10K time means we can still keep it :D

lehuyduc commented 9 months ago

Oh, I found the reason. It's a typo:

  size_t end_idx0 = idx1 - 1;
  size_t end_idx1 = to_byte;
  size_t end_idx0_pre = end_idx0 - 2 * MAX_KEY_LENGTH;
  size_t end_idx1_pre = end_idx0 - 2 * MAX_KEY_LENGTH; => should be end_idx1

  handle_line_packed(data + idx0, data + end_idx0_pre, hmaps[tid], idx0);
  handle_line_packed(data + idx1, data + end_idx1_pre, hmaps[tid], idx1);  

I'll test a few more ideas and push again later this weekend. Thanks for helping me notice the problem!