haskell-unordered-containers / unordered-containers

Efficient hashing-based container types
BSD 3-Clause "New" or "Revised" License
221 stars 99 forks source link

Reduce insert code size #75

Open tibbe opened 10 years ago

tibbe commented 10 years ago

Currently the insert function generated ~600 lines of Cmm with GHC HEAD. The breakdown for each constructor is roughly:

plus 33 (5.5%) shared lines. There are also 67 non-code lines.

Code related to collisions could probably be pulled out-of-line without a performance hit. We might also be able to avoid inlining larger function, such as Array.insert, without a performance penalty.

treeowl commented 5 years ago

In an informal experiment, I saw quite a performance improvement by just refraining from inlining two. Specifically, I renamed the current function twoBase and then wrote monadic and pure versions calling it, each of which were NOINLINE. I bet you're right about those larger functions as well— pretty much any thaw-modify-unsafeFreeze would be a good candidate, especially if it doesn't have to call a user function or use a class instance in the process. And I know you're wary of INLINABLE, but I have the feeling we could probably afford to use it a bit more.

treeowl commented 5 years ago

As I've mentioned before, I also want to experiment with getting rid of the Full constructor.

treeowl commented 5 years ago

Hrmph.... Simply removing Full gives horrible performance. Oh well.

sjakobi commented 2 years ago

In an informal experiment, I saw quite a performance improvement by just refraining from inlining two. Specifically, I renamed the current function twoBase and then wrote monadic and pure versions calling it, each of which were NOINLINE.

This still seems worth implementing. Although I don't currently understand why both a monadic and a pure version are required.


https://github.com/haskell-unordered-containers/unordered-containers/blob/8c20f7ad6f58dd1477bbd0a3f8bf122f6d20ad4c/Data/HashMap/Internal.hs#L969-L987

The code size of two itself could be substantially reduced by tweaking the computation of idx2. Currently it involves a branch where the two alternatives differ only by the array index of the write:

          case <#
                 (word2Int# (and# (uncheckedShiftRL# ww1 ww) 31##))
                 (word2Int# (and# (uncheckedShiftRL# ww2 ww) 31##))
          of {
            __DEFAULT ->
              case writeSmallArray# @s @(HashMap k v) ipv1 0# w2 ipv of s'
              { __DEFAULT ->
              case unsafeFreezeSmallArray# @s @(HashMap k v) ipv1 s' of
              { (# ipv2, ipv3 #) ->
              (# ipv2, BitmapIndexed @k @v (or# x y) ipv3 #)
              }
              };
            1# ->
              case writeSmallArray# @s @(HashMap k v) ipv1 1# w2 ipv of s'
              { __DEFAULT ->
              case unsafeFreezeSmallArray# @s @(HashMap k v) ipv1 s' of
              { (# ipv2, ipv3 #) ->
              (# ipv2, BitmapIndexed @k @v (or# x y) ipv3 #)
              }
              }

This branch could probably be entirely avoided if we'd re-use the comparison result as the index, maybe with something like

idx2 = fromEnum (index h1 s < index h2 s)

(The comparison itself could probably be optimized a tiny bit too: We don't actually need the index – just clearing the higher bits should be sufficient.)