haskellfoundation / hs-opt-handbook.github.io

The Haskell Optimization Handbook
https://haskell.foundation/hs-opt-handbook.github.io/
Creative Commons Attribution 4.0 International
169 stars 12 forks source link

The microest optimization wrt pointer tagging #84

Open alt-romes opened 1 year ago

alt-romes commented 1 year ago

I had a thought recently about something that affects runtime performance, but possibly in such a way so minimal that it might not even be worth adding to the book.

Nonetheless...

GHC tags pointers to evaluated heap-allocated data types with their tag. So, for example, pointers to heap-allocated Maybes will be tagged with +1 if the allocated-value is Nothing and with +2 if it is Just ..., (and with +0 if it is a thunk, in which case we evaluate the thunk which will return a tagged pointer (with either +1 or +2)).

{-# LANGUAGE MagicHash #-}
import Prelude(print)
import GHC.Exts

data Maybe a = Nothing
             |  Just a

{-# NOINLINE f #-}
f x = case x of
        Nothing -> 1#
        Just _    -> 2#

main = print (I# (f (Just 4)))

Then, to pattern match on x in the body of f, we don't need to dereference the pointer to the heap and check the constructor info, for we can simply (when x is evaluated) look at its tag. The above example compiles to the following cmm code with ghc -dno-typeable-binds -fforce-recomp -ddump-opt-cmm -ddump-to-file X.hs:

f_rib_entry() { //  [R2]
      ...
      cOo: // global
          I64[Sp - 8] = block_cOf_info;
          R1 = _sO4::P64;
          Sp = Sp - 8;
          // Check if tag is 0
          if (R1 & 7 != 0) goto cOf; else goto cOg;
      cOg: // tag==0 -> evaluate thunk
          call (I64[R1])(R1) returns to cOf, args: 8, res: 8, upd: 8;
      cOf: // tag/=0 -> pattern match through constructor tag
          _sO5::P64 = R1;
          _cOl::P64 = _sO5::P64 & 7;
          // Check if tag is 1 or 2
          if (_cOl::P64 != 1) goto cOk; else goto cOj;
      cOk: // Tag is 2, con is Just, return unboxed 2
          R1 = 2;
          Sp = Sp + 8;
          call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
      cOj: // Tag is 1, con is Nothing, return unboxed 1
          R1 = 1;
          Sp = Sp + 8;
          call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
    }
}

The question is what happens when we have more than 7 constructors (the tag can only be in the last 3 bits of the pointer, and 0 means unevaluated data type). In the case of a datatype with more than 7 constructors, a tag of 7 means that the constructor index is 7 or higher, and therefore we have to dereference the pointer and look at the constructor info. The other tags from 0-6 retain their meaning.

Therefore, the most common constructors of a datatype with >7 constructors are better off being in the first 6 constructors, and the least common should come later. So if you have a datatype of which 80% of the times is constructed with some Con1, it is more performant to have it be one of the first 6.

This could possibly matter in really, REALLY, tight loops 😝, though I would love to see such a case. Or a gigantic code base in which all the useful constructors are defined last.

doyougnu commented 1 year ago

I actually experimented with this when writing the weigh chapter but I couldn't find any significant difference with an extremely small program. But I do think this effect is observable, we just need to find the right program in the wild, perhaps in pandoc or xmonad.

hsyl20 commented 1 year ago

You could experiment with GHC Core's Expr type: what happens if you move the Var constructor to the last position? :)

alt-romes commented 1 year ago

I'd love to see those benchmarks heh

doyougnu commented 1 year ago

sounds like a good case study