Inlining `memberWord8` and `memberWord8.search` functions

KovalevDima commented 2 years ago

I speak English very bad, so i'm sorry if there are serious grammatical errors.

I was working on an application where parsing was the most expensive part and during profiling I found a memory and cpu leaks on calls of memberWord8 and memberWord8.searh functions.

You can see huge number of entries of this two functions with high individual memory allocation and cpu time percents.

...
    total time  =        2.00 secs   (8015 ticks @ 1000 us, 4 processors)
    total alloc = 9,213,435,328 bytes  (excludes profiling overheads)
...
                                                                                                                                                                                                individual      inherited
COST CENTRE                         MODULE                                SRC                                                                                              no.       entries  %time %alloc   %time %alloc
...
     mkParser                       App.Env.Parser                        src/App/Env/Parser.hs:(21,1)-(51,73)                                                             70305           0    6.3    0.0    86.2   81.3
      memberWord8                   Data.Attoparsec.ByteString.FastSet    Data/Attoparsec/ByteString/FastSet.hs:(76,1)-(87,32)                                             70326    62988619   12.4    0.0    30.0   26.5
       memberWord8.search           Data.Attoparsec.ByteString.FastSet    Data/Attoparsec/ByteString/FastSet.hs:(80,11)-(87,32)                                            70334   123937238   17.4   26.5    17.6   26.5
        memberWord8.search.mid      Data.Attoparsec.ByteString.FastSet    Data/Attoparsec/ByteString/FastSet.hs:83:23-46                                                   70335    62868619    0.3    0.0     0.3    0.0
       memberWord8.(...)            Data.Attoparsec.ByteString.FastSet    Data/Attoparsec/ByteString/FastSet.hs:77:9-43                                                    70356      120000    0.0    0.0     0.0    0.0

...

After inling both of thems, i got 25% perfomance increase as you can see (total time and total alloc). I produced the same conditions for both of application runs

...
    total time  =        1.53 secs   (6112 ticks @ 1000 us, 4 processors)
    total alloc = 6,767,789,280 bytes  (excludes profiling overheads)
...
                                                                                                                                                                                                individual      inherited
COST CENTRE                         MODULE                                SRC                                                                                              no.       entries  %time %alloc   %time %alloc
...
     mkParser                       App.Env.Parser                        src/App/Env/Parser.hs:(21,1)-(51,73)                                                             70315           0    9.7    0.0    82.5   74.5
...

I have no experience of working with cabal and there is no documentaion of how to build benchmarks fot this project, so I hope you check if there is any benefits in your benchmarks with INLINE versions of these functions.

Waiting for any discussion and would be glad to answer any questions

KovalevDima commented 2 years ago

I ran benchmarks with fix #199 and did't saw anything interesting. I think its predictable since my case was a pretty selective.

My app parses many long lines of logs calling a lot of times inClass and notInClass functions from Data.Attoparsec.ByteString module that exports it from Data.Attoparsec.ByteString.Internal module and uses memberWord8 under the hood.

Benchmarks results

Left - before inlining
Right - after inlining

All
  many
    attoparsec
      B:      4.67 μs ± 352 ns   4.71 μs ± 334 ns
      BL:     29.1 μs ± 2.7 μs   29.1 μs ± 2.1 μs
      T:      12.9 μs ± 817 ns   13.0 μs ± 823 ns
      TL:     39.8 μs ± 2.9 μs   39.6 μs ± 2.9 μs
    parsec
      S:      100  μs ± 6.3 μs   106  μs ± 6.6 μs
      B:      110  μs ± 7.0 μs   112  μs ± 5.6 μs
      BL:     117  μs ± 6.3 μs   122  μs ± 5.6 μs
      T:      112  μs ±  11 μs   122  μs ± 2.8 μs
      TL:     125  μs ± 3.0 μs   134  μs ± 9.9 μs
  comparison
    many-vs-takeWhile
      many:      4.82 μs ± 333 ns    4.83 μs ± 372 ns
      takeWhile: 1.22 μs ± 103 ns    1.36 μs ±  84 ns
    letter-vs-isAlpha
      letter:    6.39 μs ± 343 ns    6.32 μs ± 350 ns
      isAlpha:   4.84 μs ± 340 ns    5.11 μs ± 367 ns
  takeWhile
    isAlpha:            20.5 μs ± 1.5 μs    20.9 μs ± 1.4 μs
    isAlpha_ascii:      12.2 μs ± 688 ns    12.2 μs ± 672 ns
    isAlpha_iso8859_15: 14.2 μs ± 717 ns    14.5 μs ± 1.4 μs
    T isAlpha:          8.38 μs ± 708 ns    7.86 μs ± 712 ns
    TL isAlpha:         30.9 μs ± 2.8 μs    28.6 μs ± 2.8 μs
  takeWhile1
    isAlpha:       22.1 μs ± 1.4 μs   21.9 μs ± 1.6 μs
    isAlpha_ascii: 13.1 μs ± 809 ns   12.9 μs ± 759 ns
    T isAlpha:     7.25 μs ± 673 ns   8.10 μs ± 728 ns
    TL isAlpha:    30.2 μs ± 2.9 μs   28.3 μs ± 2.4 μs
  word32LE:      21.6 ns ± 1.3 ns   21.4 ns ± 1.3 ns
scan
    short:   44.7 ns ± 3.1 ns    41.2 ns ± 3.3 ns
    long:    864  ns ±  45 ns    861  ns ±  52 ns
  aeson
    example:    41.9 μs ± 2.7 μs   42.2 μs ± 3.2 μs
    geometry:   2.95 ms ± 218 μs   2.85 ms ± 270 μs
    integers:   208  μs ±  16 μs   211  μs ±  16 μs
    jp10:       258  μs ±  11 μs   252  μs ±  14 μs
    jp100:      1.94 ms ± 166 μs   1.84 ms ±  63 μs
    jp50:       1.02 ms ±  98 μs   964  μs ±  69 μs
    numbers:    401  μs ±  29 μs   391  μs ±  28 μs
    twitter1:   14.8 μs ± 809 ns   14.3 μs ± 1.3 μs
    twitter10:  96.7 μs ± 6.3 μs   91.9 μs ± 6.3 μs
    twitter100: 1.40 ms ±  77 μs   1.35 ms ± 115 μs
    twitter20:  193  μs ±  11 μs   186  μs ±  13 μs
    twitter50:  706  μs ±  24 μs   646  μs ±  53 μs
  genome
    bytestring
      s:  76.7 μs ± 5.5 μs   72.8 μs ± 5.6 μs
      l:  201  μs ±  14 μs   200  μs ±  15 μs
      CI
        s: 110  μs ± 6.5 μs   106  μs ± 8.8 μs
        l: 234  μs ±  22 μs   234  μs ±  11 μs
    text
      s: 73.1 μs ± 5.9 μs   72.3 μs ± 6.6 μs
      l: 201  μs ±  13 μs   197  μs ±  13 μs
      CI
        s: 188  μs ±  12 μs   185  μs ±  16 μs
        l: 313  μs ±  21 μs   314  μs ±  17 μs
  headers
    B
      request:  1.99 μs ± 180 ns   1.96 μs ± 182 ns
      warp:     801  ns ±  80 ns   807  ns ±  56 ns
      response: 3.00 μs ± 245 ns   3.03 μs ± 192 ns
    BL
      request:  9.17 μs ± 385 ns   9.17 μs ± 778 ns
      response: 12.6 μs ± 930 ns   12.7 μs ± 968 ns
  headers
    T
      request:  2.51 μs ± 201 ns   2.49 μs ± 237 ns
      response: 3.59 μs ± 252 ns   3.64 μs ± 199 ns
    TL
      request:  16.6 μs ± 849 ns   14.5 μs ± 728 ns
      response: 22.8 μs ± 2.2 μs   20.2 μs ± 1.6 μs
  links:  400  ns ±  24 ns   416  ns ±  29 ns
  numbers
    Text
      no power
        double:     937  ns ±  85 ns   885  ns ±  49 ns
        number:     981  ns ±  47 ns   938  ns ±  69 ns
        rational:   602  ns ±  47 ns   570  ns ±  43 ns
        scientific: 1.49 μs ±  53 ns   1.48 μs ± 114 ns
      positive power
        double:     962  ns ±  79 ns   976  ns ±  68 ns
        number:     985  ns ±  95 ns   1.02 μs ±  89 ns
        rational:   709  ns ±  45 ns   673  ns ±  49 ns
        scientific: 984  ns ±  82 ns   978  ns ±  62 ns
      negative power
        double:     1.08 μs ± 102 ns   1.09 μs ±  65 ns
        number:     1.09 μs ±  93 ns   1.14 μs ± 110 ns
        rational:   744  ns ±  61 ns   716  ns ±  45 ns
        scientific: 2.37 μs ± 198 ns   2.33 μs ± 211 ns
    ByteString
      no power
        double:     846  ns ±  60 ns   846  ns ±  46 ns
        number:     852  ns ±  70 ns   874  ns ±  80 ns
        rational:   534  ns ±  47 ns   536  ns ±  52 ns
        scientific: 1.44 μs ± 133 ns   1.43 μs ± 107 ns
      positive power
        double:     874  ns ±  55 ns   885  ns ±  44 ns
        number:     912  ns ±  59 ns   918  ns ±  87 ns
        rational:   650  ns ±  31 ns   634  ns ±  45 ns
        scientific: 982  ns ±  66 ns   942  ns ±  51 ns
      negative power
        double:     984  ns ±  60 ns   1.02 μs ±  86 ns
        number:     1.04 μs ±  57 ns   983  ns ±  93 ns
        rational:   665  ns ±  29 ns   694  ns ±  54 ns
        scientific: 2.29 μs ± 118 ns   2.33 μs ± 160 ns
  sets
    Fast:     6.67 ns ± 412 ps   6.41 ns ± 362 ps
    Hash:     21.6 ns ± 2.0 ns   13.0 ns ± 688 ps
    Int:      7.84 ns ± 692 ps   8.37 ns ± 642 ps
    TextFast: 6.19 ns ± 236 ps   6.96 ns ± 668 ps
  warp
    decimal
      warp: 25.5 ns ± 2.0 ns   26.1 ns ± 1.9 ns
      atto: 39.0 ns ± 2.9 ns   37.5 ns ± 3.2 ns

KovalevDima commented 2 years ago

I'm also leaving here results of heap visualization of my application runs

Inlined `memberWord8` and `memberWord8.search`

![image](https://user-images.githubusercontent.com/60290615/147984408-4f712468-bc71-4a22-9a63-c5acf71b77e5.png)

Not inlined `memberWord8` and `memberWord8.search`

![image](https://user-images.githubusercontent.com/60290615/147984367-4bb8c876-6a2b-4dff-8ea9-c5ba0af8b950.png)

Bodigrim commented 2 years ago

I wonder if one can get even better results implementing data FastSet = FastSet !Int64 !Int64 !Int64 !Int64 or data FastSet = FastSet ByteArray# as a plain bitmap. (Feel free to message me in Telegram if you wish to discuss further)

KovalevDima commented 2 years ago

There are could be profiling code generation reason of resources consuming difference. I tried to look at pure binary perfomance and didn't saw any difference between current version and version with this 2 inlined fuctions. So sad for me since i've been thinking that i can trust information from profiling without any additional complex facts to know.

Maybe its still worth to commit this change for better profiling information to other peoples. Maybe i don't know some details of how profiling code generation works and it doesn't make any sense to do. I've been using stack install --profiling and app-name +RTS -N4 -A64m -AL256m -h -p -L250 -RTS without any additional profiling flags and {-# SCC #-} pragmas in code. I will try to measure perfomance in a different way and will write about the results later.

KovalevDima commented 2 years ago

My benchmark parses 500k strings and there is no difference between two versions.

Without INLINE pragma:

Benchmark bench: RUNNING...   
Benchmark bench: RUNNING...
All
  attoparsec: OK (146.48s)
    20.40 s ± 697 ms

All 1 tests passed (146.48s)

With INLINE pragma:

Benchmark bench: RUNNING...   
All                           
  attoparsec: OK (147.42s)    
    20.60 s ± 1.64 s          

All 1 tests passed (147.42s)  
Benchmark bench: FINISH

bgamari commented 2 years ago

There are could be profiling code generation reason of resources consuming difference

Indeed, I strongly suspect that what you were seeing was an artifact of compiling with profiling enabled. Enabling profiling in a module greatly constrains the optimisations that GHC can employ. I would compare the Core before and after your patch to determine whether there really is anything useful done by this patch.

KovalevDima commented 2 years ago

I'm pretty sure that problem was at inability to inline function due to the automatic adjustment of the cost centers. By default compiler sets cost centres to all of not inlined functions. When you set -fno-prof-auto flag and pass it to all of libraries via

apply-ghc-options: everything

in stack, the resources consuming difference disappears. The same situation happens when you set INLINE pragma to function. That's why my benchmarks doesn't have any difference with inlined and not inlined version

KovalevDima commented 2 years ago

This PR can be closed if you don't have any reasons to change automatic setting of cost centers. Because of this situation I think passing the SCC pragmas manually is the only way to profile your application

haskell / attoparsec

Inlining `memberWord8` and `memberWord8.search` functions #197