Open buybackoff opened 10 months ago
Thanks! Can you run again but with N_THREADS
set to number of threads (cores * 2) on the test machine? Running 128 threads on a 12 thread PC is slower than running just 12
I've updated the repo so that run_cpp.sh automatically finds number of threads to use, and compile with that number as constexpr. Could you pull and run it again? Thanks!
On my machine I get
$ chmod +x run_cpp.sh
$ ./run_cpp.sh
Using 8 threads
init mmap file cost = 0.041405ms
Parallel process file cost = 18528.8ms
Aggregate stats cost = 1.22172ms
Output stats cost = 1.90048ms
Runtime inside main = 18532ms
Time to munmap = 263.052ms
real 0m18,809s
user 0m30,831s
sys 0m7,561s
867e3fb8c93a52eddb94bd99b8b87c1f355bd225a58249eba97ce4eb87eb58cc result.txt
and for my go version https://github.com/AlexanderYastrebov/1brc/tree/go-implementation/src/main/go
$ (cd src/main/go/ && go build -o /tmp/1brc-go && time /tmp/1brc-go ../../../measurements-109.txt) | sha256sum
real 0m20,917s
user 0m45,232s
sys 0m7,821s
867e3fb8c93a52eddb94bd99b8b87c1f355bd225a58249eba97ce4eb87eb58cc -
@buybackoff would be possible to add a record for go implementation in your chart?
I proposed in https://github.com/gunnarmorling/1brc/discussions/253 to support Docker builds but this is out of scope for now as per https://github.com/gunnarmorling/1brc/pull/182#discussion_r1445841593
That looks unusually slow o.O I get 13.5-14.5 with 1 thread on 2 separate test PC.
I think PC specs should be included any time a benchmark number is posted. Also any related inputs
@lehuyduc Sure, that's why I would like to know how it stands compared to your version, C# and best java. PS: I posted my specs in the https://github.com/gunnarmorling/1brc/discussions/67
@buybackoff would be possible to add a record for go implementation in your chart?
@AlexanderYastrebov Will do eventually, not today/tomorrow.
@lehuyduc I'm reading your 1brc_final_valid.cpp
and this is cool stuff. I would say [Samuel L. Jackson from Pulp Fiction]-level cool. And it's quite simple nevertheless, but the some tricks/hacks are beautiful. Very educating!
@lehuyduc I'm reading your
1brc_final_valid.cpp
and this is cool stuff. I would say [Samuel L. Jackson from Pulp Fiction]-level cool. And it's quite simple nevertheless, but the some tricks/hacks are beautiful. Very educating!
@buybackoff I'm not done yet, still has some tiny tricks left :D Also the latest version is ~28% faster compared to the last one you tested
@AlexanderYastrebov sorry but it's so inconvenient. Not only it's in Go and I need to download tools, it's inside this huge Java fork. If you think it's a real contender for top spots then please create a separate repo and instructions to build. Also the file path should be the first argument to an executable.
@buybackoff Hi, no worries. I hope maybe I can get it into the main repo https://github.com/gunnarmorling/1brc/pull/298 to compare against java
@AlexanderYastrebov sorry but it's so inconvenient. Not only it's in Go and I need to download tools, it's inside this huge Java fork. If you think it's a real contender for top spots then please create a separate repo and instructions to build. Also the file path should be the first argument to an executable.
You just need to download this file: https://github.com/gunnarmorling/1brc/blob/2e2699216bf21f0bf3639595724aeac540a9b555/src/main/go/AlexanderYastrebov/calc.go
go build calc.go
(ridiculous fast! go compile time is amazing =))
./calc measurements.txt
I think this go
version would be performing pretty well in this rank:
https://hotforknowledge.com/2024/01/13/7-1brc-in-dotnet-even-faster-than-java-cpp/
It would be really nice if @buybackoff could run your solution.
@lehuyduc Hi, it looks like your yesterday version is much faster on 10K, but much slower on the default data. Is that expected?
run_cpp.sh expects main.cpp, but I used 1brc_final_valid.cpp as before.
Hmm, I think this might be due to CPU difference. I tested the latest code on AMD 7502P directly instead of AMD 2950X like before. Both default + 10K results are faster on 7502P.
What results do you get when running the latest version?
The default goes from 1.649 to 1.792. Tried both 6/12 compilation.
10K goes from 3.081 to 2.863
Running only on P-cores of i5-13500.
Hmm, I really don't know what cause this. It might even be due to Intel vs AMD difference. I tested on 2950X again, and it didn't cause any slowdown (but no speedup because the CPU is quite old).
I'll try looking into it later. But lower 10K time means we can still keep it :D
Oh, I found the reason. It's a typo:
size_t end_idx0 = idx1 - 1;
size_t end_idx1 = to_byte;
size_t end_idx0_pre = end_idx0 - 2 * MAX_KEY_LENGTH;
size_t end_idx1_pre = end_idx0 - 2 * MAX_KEY_LENGTH; => should be end_idx1
handle_line_packed(data + idx0, data + end_idx0_pre, hmaps[tid], idx0);
handle_line_packed(data + idx1, data + end_idx1_pre, hmaps[tid], idx1);
I'll test a few more ideas and push again later this weekend. Thanks for helping me notice the problem!
So far the top result overall. Great job!
https://github.com/buybackoff/1brc?tab=readme-ov-file#native