Open tibbe opened 10 years ago
In an informal experiment, I saw quite a performance improvement by just refraining from inlining two
. Specifically, I renamed the current function twoBase
and then wrote monadic and pure versions calling it, each of which were NOINLINE
. I bet you're right about those larger functions as well— pretty much any thaw-modify-unsafeFreeze would be a good candidate, especially if it doesn't have to call a user function or use a class instance in the process. And I know you're wary of INLINABLE
, but I have the feeling we could probably afford to use it a bit more.
As I've mentioned before, I also want to experiment with getting rid of the Full
constructor.
Hrmph.... Simply removing Full
gives horrible performance. Oh well.
In an informal experiment, I saw quite a performance improvement by just refraining from inlining
two
. Specifically, I renamed the current functiontwoBase
and then wrote monadic and pure versions calling it, each of which wereNOINLINE
.
This still seems worth implementing. Although I don't currently understand why both a monadic and a pure version are required.
The code size of two
itself could be substantially reduced by tweaking the computation of idx2
. Currently it involves a branch where the two alternatives differ only by the array index of the write:
case <#
(word2Int# (and# (uncheckedShiftRL# ww1 ww) 31##))
(word2Int# (and# (uncheckedShiftRL# ww2 ww) 31##))
of {
__DEFAULT ->
case writeSmallArray# @s @(HashMap k v) ipv1 0# w2 ipv of s'
{ __DEFAULT ->
case unsafeFreezeSmallArray# @s @(HashMap k v) ipv1 s' of
{ (# ipv2, ipv3 #) ->
(# ipv2, BitmapIndexed @k @v (or# x y) ipv3 #)
}
};
1# ->
case writeSmallArray# @s @(HashMap k v) ipv1 1# w2 ipv of s'
{ __DEFAULT ->
case unsafeFreezeSmallArray# @s @(HashMap k v) ipv1 s' of
{ (# ipv2, ipv3 #) ->
(# ipv2, BitmapIndexed @k @v (or# x y) ipv3 #)
}
}
This branch could probably be entirely avoided if we'd re-use the comparison result as the index, maybe with something like
idx2 = fromEnum (index h1 s < index h2 s)
(The comparison itself could probably be optimized a tiny bit too: We don't actually need the index
– just clearing the higher bits should be sufficient.)
Currently the
insert
function generated ~600 lines of Cmm with GHC HEAD. The breakdown for each constructor is roughly:plus 33 (5.5%) shared lines. There are also 67 non-code lines.
Code related to collisions could probably be pulled out-of-line without a performance hit. We might also be able to avoid inlining larger function, such as
Array.insert
, without a performance penalty.