Open YellowOnion opened 6 months ago
The benchmark is way too complicated to come to any conclusion: parsing, channels, bounded IO. It needs analysis as a whole since any part of it can be at fault here. Instead I recommend incrementally simplifying it until you can pinpoint the issue. Ideally it should be reproducible with something like a Map Int Int
getting written to from multiple concurrent threads. There's no need for parsing, file reads or bound threads.
I migrated some code using unordered-containers Strict HashMap. https://github.com/YellowOnion/1brc/blob/14842cd63b6f85892f4b9b315be205fa7c2b1eaf/app/Main.hs#L116
I use one map per-thread and then merge them at the end, this should blow-out my L3 (64MB) and create substantially slower code than using one shared Map.
This implementation takes 12s and uses about 450% CPU load.
Migrating my code to stm-containers, it takes 16s and consumes 950% CPU, and spends 50% of it's time inside
onBranchElement
.The core loop looks like this:
Here's a profile:
1brc.prof.txt