Closed KovalevDima closed 2 years ago
I ran benchmarks with fix #199 and did't saw anything interesting. I think its predictable since my case was a pretty selective.
My app parses many long lines of logs calling a lot of times inClass
and notInClass
functions from Data.Attoparsec.ByteString
module that exports it from Data.Attoparsec.ByteString.Internal
module and uses memberWord8
under the hood.
Left - before inlining Right - after inlining All many attoparsec B: 4.67 μs ± 352 ns 4.71 μs ± 334 ns BL: 29.1 μs ± 2.7 μs 29.1 μs ± 2.1 μs T: 12.9 μs ± 817 ns 13.0 μs ± 823 ns TL: 39.8 μs ± 2.9 μs 39.6 μs ± 2.9 μs parsec S: 100 μs ± 6.3 μs 106 μs ± 6.6 μs B: 110 μs ± 7.0 μs 112 μs ± 5.6 μs BL: 117 μs ± 6.3 μs 122 μs ± 5.6 μs T: 112 μs ± 11 μs 122 μs ± 2.8 μs TL: 125 μs ± 3.0 μs 134 μs ± 9.9 μs comparison many-vs-takeWhile many: 4.82 μs ± 333 ns 4.83 μs ± 372 ns takeWhile: 1.22 μs ± 103 ns 1.36 μs ± 84 ns letter-vs-isAlpha letter: 6.39 μs ± 343 ns 6.32 μs ± 350 ns isAlpha: 4.84 μs ± 340 ns 5.11 μs ± 367 ns takeWhile isAlpha: 20.5 μs ± 1.5 μs 20.9 μs ± 1.4 μs isAlpha_ascii: 12.2 μs ± 688 ns 12.2 μs ± 672 ns isAlpha_iso8859_15: 14.2 μs ± 717 ns 14.5 μs ± 1.4 μs T isAlpha: 8.38 μs ± 708 ns 7.86 μs ± 712 ns TL isAlpha: 30.9 μs ± 2.8 μs 28.6 μs ± 2.8 μs takeWhile1 isAlpha: 22.1 μs ± 1.4 μs 21.9 μs ± 1.6 μs isAlpha_ascii: 13.1 μs ± 809 ns 12.9 μs ± 759 ns T isAlpha: 7.25 μs ± 673 ns 8.10 μs ± 728 ns TL isAlpha: 30.2 μs ± 2.9 μs 28.3 μs ± 2.4 μs word32LE: 21.6 ns ± 1.3 ns 21.4 ns ± 1.3 ns scan short: 44.7 ns ± 3.1 ns 41.2 ns ± 3.3 ns long: 864 ns ± 45 ns 861 ns ± 52 ns aeson example: 41.9 μs ± 2.7 μs 42.2 μs ± 3.2 μs geometry: 2.95 ms ± 218 μs 2.85 ms ± 270 μs integers: 208 μs ± 16 μs 211 μs ± 16 μs jp10: 258 μs ± 11 μs 252 μs ± 14 μs jp100: 1.94 ms ± 166 μs 1.84 ms ± 63 μs jp50: 1.02 ms ± 98 μs 964 μs ± 69 μs numbers: 401 μs ± 29 μs 391 μs ± 28 μs twitter1: 14.8 μs ± 809 ns 14.3 μs ± 1.3 μs twitter10: 96.7 μs ± 6.3 μs 91.9 μs ± 6.3 μs twitter100: 1.40 ms ± 77 μs 1.35 ms ± 115 μs twitter20: 193 μs ± 11 μs 186 μs ± 13 μs twitter50: 706 μs ± 24 μs 646 μs ± 53 μs genome bytestring s: 76.7 μs ± 5.5 μs 72.8 μs ± 5.6 μs l: 201 μs ± 14 μs 200 μs ± 15 μs CI s: 110 μs ± 6.5 μs 106 μs ± 8.8 μs l: 234 μs ± 22 μs 234 μs ± 11 μs text s: 73.1 μs ± 5.9 μs 72.3 μs ± 6.6 μs l: 201 μs ± 13 μs 197 μs ± 13 μs CI s: 188 μs ± 12 μs 185 μs ± 16 μs l: 313 μs ± 21 μs 314 μs ± 17 μs headers B request: 1.99 μs ± 180 ns 1.96 μs ± 182 ns warp: 801 ns ± 80 ns 807 ns ± 56 ns response: 3.00 μs ± 245 ns 3.03 μs ± 192 ns BL request: 9.17 μs ± 385 ns 9.17 μs ± 778 ns response: 12.6 μs ± 930 ns 12.7 μs ± 968 ns headers T request: 2.51 μs ± 201 ns 2.49 μs ± 237 ns response: 3.59 μs ± 252 ns 3.64 μs ± 199 ns TL request: 16.6 μs ± 849 ns 14.5 μs ± 728 ns response: 22.8 μs ± 2.2 μs 20.2 μs ± 1.6 μs links: 400 ns ± 24 ns 416 ns ± 29 ns numbers Text no power double: 937 ns ± 85 ns 885 ns ± 49 ns number: 981 ns ± 47 ns 938 ns ± 69 ns rational: 602 ns ± 47 ns 570 ns ± 43 ns scientific: 1.49 μs ± 53 ns 1.48 μs ± 114 ns positive power double: 962 ns ± 79 ns 976 ns ± 68 ns number: 985 ns ± 95 ns 1.02 μs ± 89 ns rational: 709 ns ± 45 ns 673 ns ± 49 ns scientific: 984 ns ± 82 ns 978 ns ± 62 ns negative power double: 1.08 μs ± 102 ns 1.09 μs ± 65 ns number: 1.09 μs ± 93 ns 1.14 μs ± 110 ns rational: 744 ns ± 61 ns 716 ns ± 45 ns scientific: 2.37 μs ± 198 ns 2.33 μs ± 211 ns ByteString no power double: 846 ns ± 60 ns 846 ns ± 46 ns number: 852 ns ± 70 ns 874 ns ± 80 ns rational: 534 ns ± 47 ns 536 ns ± 52 ns scientific: 1.44 μs ± 133 ns 1.43 μs ± 107 ns positive power double: 874 ns ± 55 ns 885 ns ± 44 ns number: 912 ns ± 59 ns 918 ns ± 87 ns rational: 650 ns ± 31 ns 634 ns ± 45 ns scientific: 982 ns ± 66 ns 942 ns ± 51 ns negative power double: 984 ns ± 60 ns 1.02 μs ± 86 ns number: 1.04 μs ± 57 ns 983 ns ± 93 ns rational: 665 ns ± 29 ns 694 ns ± 54 ns scientific: 2.29 μs ± 118 ns 2.33 μs ± 160 ns sets Fast: 6.67 ns ± 412 ps 6.41 ns ± 362 ps Hash: 21.6 ns ± 2.0 ns 13.0 ns ± 688 ps Int: 7.84 ns ± 692 ps 8.37 ns ± 642 ps TextFast: 6.19 ns ± 236 ps 6.96 ns ± 668 ps warp decimal warp: 25.5 ns ± 2.0 ns 26.1 ns ± 1.9 ns atto: 39.0 ns ± 2.9 ns 37.5 ns ± 3.2 ns
I'm also leaving here results of heap visualization of my application runs
I wonder if one can get even better results implementing data FastSet = FastSet !Int64 !Int64 !Int64 !Int64
or data FastSet = FastSet ByteArray#
as a plain bitmap. (Feel free to message me in Telegram if you wish to discuss further)
There are could be profiling code generation reason of resources consuming difference. I tried to look at pure binary perfomance and didn't saw any difference between current version and version with this 2 inlined fuctions. So sad for me since i've been thinking that i can trust information from profiling without any additional complex facts to know.
Maybe its still worth to commit this change for better profiling information to other peoples. Maybe i don't know some details of how profiling code generation works and it doesn't make any sense to do. I've been using stack install --profiling
and app-name +RTS -N4 -A64m -AL256m -h -p -L250 -RTS
without any additional profiling flags and {-# SCC #-}
pragmas in code. I will try to measure perfomance in a different way and will write about the results later.
My benchmark parses 500k strings and there is no difference between two versions.
Without INLINE
pragma:
Benchmark bench: RUNNING...
Benchmark bench: RUNNING...
All
attoparsec: OK (146.48s)
20.40 s ± 697 ms
All 1 tests passed (146.48s)
With INLINE
pragma:
Benchmark bench: RUNNING...
All
attoparsec: OK (147.42s)
20.60 s ± 1.64 s
All 1 tests passed (147.42s)
Benchmark bench: FINISH
There are could be profiling code generation reason of resources consuming difference
Indeed, I strongly suspect that what you were seeing was an artifact of compiling with profiling enabled. Enabling profiling in a module greatly constrains the optimisations that GHC can employ. I would compare the Core before and after your patch to determine whether there really is anything useful done by this patch.
I'm pretty sure that problem was at inability to inline function due to the automatic adjustment of the cost centers. By default compiler sets cost centres to all of not inlined functions. When you set -fno-prof-auto
flag and pass it to all of libraries via
apply-ghc-options: everything
in stack, the resources consuming difference disappears. The same situation happens when you set INLINE
pragma to function. That's why my benchmarks doesn't have any difference with inlined and not inlined version
This PR can be closed if you don't have any reasons to change automatic setting of cost centers. Because of this situation
I think passing the SCC
pragmas manually is the only way to profile your application
I speak English very bad, so i'm sorry if there are serious grammatical errors.
I was working on an application where parsing was the most expensive part and during profiling I found a memory and cpu leaks on calls of
memberWord8
andmemberWord8.searh
functions.You can see huge number of entries of this two functions with high individual memory allocation and cpu time percents.
After inling both of thems, i got 25% perfomance increase as you can see (
total time
andtotal alloc
). I produced the same conditions for both of application runsI have no experience of working with cabal and there is no documentaion of how to build benchmarks fot this project, so I hope you check if there is any benefits in your benchmarks with INLINE versions of these functions.
Waiting for any discussion and would be glad to answer any questions