nikita-volkov / stm-hamt

STM-specialised Hash Array Mapped Trie
https://hackage.haskell.org/package/stm-hamt
MIT License
9 stars 6 forks source link

onBranchElement dominating profiles. #7

Open YellowOnion opened 6 months ago

YellowOnion commented 6 months ago

I migrated some code using unordered-containers Strict HashMap. https://github.com/YellowOnion/1brc/blob/14842cd63b6f85892f4b9b315be205fa7c2b1eaf/app/Main.hs#L116

I use one map per-thread and then merge them at the end, this should blow-out my L3 (64MB) and create substantially slower code than using one shared Map.

This implementation takes 12s and uses about 450% CPU load.

Migrating my code to stm-containers, it takes 16s and consumes 950% CPU, and spends 50% of it's time inside onBranchElement.

The core loop looks like this:


stmInsertEntry :: BS.ByteString -> Entry -> HMap -> STM ()
stmInsertEntry !k !v = Map.focus (Focus.insertOrMerge (<>) v) k

calcThread :: (OutChan (Maybe BS.ByteString), Int)
           -> HMap
           -> IO ()
calcThread (oc, id) m = runInBoundThread go
  where go = do
          mbs <- readChan oc
          case mbs of
            Just bs -> do
              let (k, v) = parseEntry bs
                in atomically $ stmInsertEntry k v m
              go
            Nothing -> return ()

Here's a profile:

1brc.prof.txt

nikita-volkov commented 6 months ago

The benchmark is way too complicated to come to any conclusion: parsing, channels, bounded IO. It needs analysis as a whole since any part of it can be at fault here. Instead I recommend incrementally simplifying it until you can pinpoint the issue. Ideally it should be reproducible with something like a Map Int Int getting written to from multiple concurrent threads. There's no need for parsing, file reads or bound threads.