silencespeakstruth commented 1 year ago

To whom it may concern...

Short story

An attempt to migrate our production to the LTS-20.26 (GHC 9.2.8) accidentally discovered a significant performance downfall.

Without the #127:

time                 9.187 ms   (9.138 ms .. 9.250 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 9.177 ms   (9.138 ms .. 9.229 ms)
std dev              127.5 μs   (81.71 μs .. 204.6 μs)

...and with it:

time                 1.052 s    (1.018 s .. 1.089 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 1.052 s    (1.044 s .. 1.061 s)
std dev              9.869 ms   (4.780 ms .. 13.74 ms)

Based on these (and many others, shared down below) observations, I tend to conclude that #127 performs about ~100 times slower.

Long story and more fruitful pieces of evidence

Once unwanted slowness was spotted, a drill-down analysis was carried out. Below I am going to describe every step taken and its outcome.

Step 1: `perf` profiling

perf record -F 33 -g -e cycles:u -- $EXE +RTS -g2 -RTS was used to profile the executable ($EXE) built with the -O 2 flag to ensure maximum optimization. Both profiles then were compared against each other. The following was observed for the slowed-down version of the code:

+   98,53%    98,53%  benchmark  benchmark               [.] ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc

Cleary, ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc stands for instance Ord implementation. 98,53% stands for "98,54% of all the userspace cycles were spent executing the ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc". That directed the investigation towards RTS-based flag instrumentation in order to be able to get a deeper insight into where these comparisons come from.

Step 2: `RTS` flags

$EXE +RTS -P -RTS produced a Haskell-native runtime profile, which then was visualized with a flame graph using the ghc-prof-flamegraph command line utility.

Comparing the flame graphs clearly highlighted that the parseLines :: HamletSettings -> String -> Result (Maybe NewlineStyle, HamletSettings, [(Int, Line)]) started to eat up significantly more CPU time [^1].

That pointed here (to the Parsec itself) and its changelog, where #127 popped up as a potential root cause. Reverting the #128 only (which fixed the #127) indeed rollbacks performance back to the expected measure. Trying to figure out what's happening it was decided to dump GHC intermediate representation.

Step 3: `GHC` intermediate analysis

The problematic piece of code was built with the -ddump-simpl -dsuppress-all -dno-suppress-type-signatures and those were compared. It turned out that the misperforming version generates more code of the following pattern:

(case s2_aeCm of { State ds7_aeCD ds8_aeCE ds9_aeCF ->
                  case ds8_aeCE of ww10_s1l0j
                  { SourcePos ww11_s1l0k ww12_s1l0l ww13_s1l0m ->
                  case $fOrd[]_$s$ccompare1 ww4_s1l09 ww11_s1l0k of {
                    LT -> ...;
                    EQ -> ...;
                    GT -> ...;
                  }
                  }
                  })

and the $fOrd[]_$s$ccompare1 method called about 8 times more often. Unfortunately, I lack knowledge of Haskell internals and thus am unable to analyze the intermediate representation further.

The conclusion

To my understanding, these 3 steps advocate #127 to be the root cause. Yet I fail to explain how could that happen. As of now, a hacky hotfix allows us to migrate to the GHC 9.2.8 avoiding the regression, however, I would like to open a discussion on how one could possibly fix it. The open questions are:

How much confidence do these three steps deliver? Would that be enough to consider #127 a root cause or some more steps to be taken?
What would be a better way to study the output of the -ddump-simpl -dsuppress-all -dno-suppress-type-signatures flags and can it help to make the issues even more concrete?

P.S.

At the moment the code involved in instrumentation lies under an NDA. Thus, I do not share it intentionally due to legal restrictions. I hope the observations above will be sufficient for this discussion to progress.

[^1]: The shakespeare-2.1.0 is running in the production.

phadej commented 1 year ago

instance Ord what
what is benchmark?
Have you tried to minimize it?
Have you tried other GHC versions?

ping @wz1000. Now I regret that I merged https://github.com/haskell/parsec/pull/128 without requiring a benchmark

TBH, I could undo #127 fix, as its motivation is replicating pure parser, i.e. somewhat contrived. This is more substantial. But either way, there should be a benchmark to verify that the change works (e.g. across different GHC versions etc. if it's an issue on all of them to begin with)

silencespeakstruth commented 1 year ago

@phadej, thanks for a quick reply. Following your questions:

From what perf is able to capture, there is no way to give a definite answer of which instance Ord that is. Based on how GHC Core dump looks, mergeError is the primary suspect. I believe GHC uses a deterministic algorithm for naming things, so its origin could be traced down by the parts of which ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc consists; no certainty here - I may be wrong.

There were different benchmarks. In the beginning, the slowdown was noticed for the unit and integration tests we have. There are many of these, so I narrowed the problem down to a few. For quite some time I investigated the problem based on them. Later on another - much more minimized and precise - benchmark was created, but it uses the shakespeare-2.1.0 package to call a parsec (because this is how the problem revealed itself originally). Either way yields exactly the same observations, so I omitted all the unrelated details and kept my report simple and concise. Unfortunately, both do use internal code under a NDA and I can't not share it due to legal restrictions.

In essence, every benchmark is just a call to the parseLines with some proprietary shakespeare-specific template, which is nothing but a text to be parsed and transformed later on. A typical template would be like five hundred lines of text with lots of @{url}, #{variable}, ^{fragment}, all of which are again shakespeare-specific. parseDoc is the entry point - but again, I intentionally avoided all these details so as not to distract reader's attention from the important bits.

Both GHC 9.0.2 and 9.2.8 were involved in benchmarking. The issue seems to disappear once you revert the #127 and its fix (#128). The other possible solution is to forcibly keep an old parsec package around, for example parsec-3.1.14.0.

wz1000 commented 1 year ago

A small(ish) reproducer would be good to have.

ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc is instance Ord String

λ> import GHC.Utils.Encoding 
λ> zDecodeString "ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc"
"ghc-prim_GHC.Classes_$fOrd[]_$s$c"

I haven't tested but this patch might help:

diff --git a/src/Text/Parsec/Error.hs b/src/Text/Parsec/Error.hs
index 9cf582a..ce0f0b5 100644
--- a/src/Text/Parsec/Error.hs
+++ b/src/Text/Parsec/Error.hs
@@ -142,8 +142,8 @@ setErrorMessage msg (ParseError pos msgs)
 mergeError :: ParseError -> ParseError -> ParseError
 mergeError e1@(ParseError pos1 msgs1) e2@(ParseError pos2 msgs2)
     -- prefer meaningful errors
-    | null msgs2 && not (null msgs1) = e1
-    | null msgs1 && not (null msgs2) = e2
+    | null msgs1 = e2
+    | null msgs2 = e1
     | otherwise
     = case pos1 `compare` pos2 of
         -- select the longest match

phadej commented 1 year ago

I haven't tested but this patch might help:

We need a small reproducer, I want make another mistake of accepting a performance related patch without a benchmark. Sorry.

wz1000 commented 1 year ago

I don't really rely on parsec critically for any performance related code, I put up this diff as well as the original patch to be helpful to others. I did not really need it to be merged.

For the record, the leak in #127 was affecting non-trivial ParsecT code, where a big loop lifting a monadic action into ParsecT leaked memory: loopManyTimes $ lift someMonadicAction :: ParsecT _ _ M _

parsonsmatt commented 1 year ago

Interesting! pos1 `compare` pos2 is comparing SourcePos, which is defined as:

type SourceName = String
type Line       = Int
type Column     = Int

data SourcePos  = SourcePos SourceName !Line !Column
    deriving ( Eq, Ord, Data, Typeable)

That stock-derived Ord is going to first inspect the SourceName ~ String ~ [Char]. That could trigger the Ord [a] dictionary.

phadej commented 1 year ago

Yes, I was going to try to use different comparator in mergeError, but I repeat: without a reproducer no change goes in.

ulidtko commented 1 year ago

Hi all. I'm collaborating with @silencespeakstruth on the same issue.

We're still working out a shareable reproducer; sorry for starting the discussion somewhat "prematurely".

@wz1000 I tested the patch — doesn't seem to make any difference (both msgs1 and msgs2 get forced to WHNF anyhow).

@parsonsmatt good point, I tried this:

instance Ord SourcePos where
  compare (SourcePos snA lnA colA) (SourcePos snB lnB colB)
    = lnA `compare` lnB <> colA `compare` colB <> snA `compare` snB

— it does give an improvement in allocations and a few % speedup. Looks like a sensible improvement in general.

However, we're measuring 100⨯ slowdown of parsec-3.15 relative to parsec-3.14 — in a use-case involving big-ish inputs to yesodweb/shakespeare (as an overall context; didn't see this clearly stated above).

Obviously, a couple % on the background of 100⨯ feels far from addressing the root cause — which we haven't yet pinned down, neither differentiated from a deficiency in shakespeare's parser.

Question 1. Any good reason I'm missing for the "empty ok" continuation in parserBind to concern itself with errors?

This passes all tests of parsec and shakespeare, and cuts down the calls to mergeError:

         -- empty-ok case for m
-        meok x s err
-          | errorIsUnknown err = unParser (k x) s cok cerr eok eerr
-          | otherwise =
-            let
-                -- in these cases, (k x) can return as empty
-                pcok = cok
-                peok x s err' = eok x s (mergeError err err')
-                pcerr = cerr
-                peerr err' = eerr (mergeError err err')
-            in  unParser (k x) s pcok pcerr peok peerr
+        meok x s _err = unParser (k x) s cok cerr eok eerr
+
         -- consumed-error case for m
         mcerr = cerr

Can a parser's "empty ok" ever be called with ParseError distinct from unknownError ?

Question 2. The {-# INLINE parserBind #-} pragma doesn't look good to me: https://github.com/haskell/parsec/blob/c5add8bd1da56ee11fd327409b3b46aa8015a974/src/Text/Parsec/Prim.hs#L316-L318 The function isn't "small". Perhaps INLINABLE ?

Question 3. This implementation for many raises doubts:

https://github.com/haskell/parsec/blob/c5add8bd1da56ee11fd327409b3b46aa8015a974/src/Text/Parsec/Prim.hs#L713-L716

I understand it may well predate Alternative and Applicative... why not Control.Applicative.many instead?

wz1000 commented 1 year ago

The {-# INLINE parserBind #-} pragma doesn't look good to me: The function isn't "small". Perhaps INLINABLE ?

Bind not inlining usually pessimises the demand analyser quite a bit for monadic code, so I would be wary of such a change.

Does removing the inline pragma help with your reproducer?

ulidtko commented 1 year ago

... not really :sweat_smile: With results cherry-picking, there's maybe a 20ms speedup without the {-# INLINE #-} — but that really needs more rigorous measurement with criterion (that takes recompiles). Nevermind @wz1000, it wouldn't be smart to split hairs over this.

The biggest-impact tweak I tested is the meok diff (Question 1). It brings our benchmark from 2640 ms to 380 ms. The baseline of this benchmark, with parsec-3.1.14 or #127 reverted — 25 ms.

phadej commented 1 year ago

This implementation for many raises doubts:

It's "correct". What's wrong is that Alternative ParsecT doesn't define some and many, I thought it did. I'll fix that.

The many implementation itself is fine, and being able to define specific many and somes is quite important. (reverse is suspicious, so maybe it could accumulate using a DList, but I don't think that matters in practice)

ulidtko commented 1 year ago

@phadej sure, understood. FWIW, I'm measuring no difference in runtime between this many and the blanket Alternative's one.

ulidtko commented 1 year ago

We've got an approve to share the reproducer.

Steps:

git clone --branch bench/parsec-3.15 https://github.com/zoominsoftware/shakespeare
cd shakespeare
cabal bench

On my machine, that yields:

Parsed successfully in 2556 cpu-milliseconds

As a baseline, this same benchmark without #127:

Parsed successfully in 35 cpu-milliseconds

I'm using cabal.project to take parsec from a local copy to do that, applying git revert b00129b5c18be6b2ffba3519761d90bc7ad0764c on top of parsec's master in that copy. There's a stack.yaml prepared for the same in that branch.

Local GHC version: 9.2.7

phadej commented 1 year ago

That is not a small reproducer. Please try to minimise.

In fact I was mistaken, (I'm sincerelly sorry @wz1000, I forgot I added the test-suite), there is a test for #127 in parsec, i.e. parsec-issue127 test-suite. I'd like to have something similar for this issue.

I.e. today if I run cabal run parsec-issue127 -- +RTS -s, I get something like

Right ()
     152,052,256 bytes allocated in the heap
          19,624 bytes copied during GC
          44,576 bytes maximum residency (2 sample(s))
          29,152 bytes maximum slop
               0 MB total memory in use (0 MB lost due to fragmentation)

But if b00129b5c18be6b2ffba3519761d90bc7ad0764c is reverted, then

Right ()
     208,052,256 bytes allocated in the heap
     343,729,016 bytes copied during GC
      52,524,512 bytes maximum residency (8 sample(s))
       1,432,096 bytes maximum slop
              50 MB total memory in use (0 MB lost due to fragmentation)

i.e. a memory leak. So b00129b5c18be6b2ffba3519761d90bc7ad0764c is not going to be reverted just because there is another performance issue.

I don't want to decide which is more important. I'd like to fix both. But to do that, I need a small reproducer for this issue.

aadaa-fgtaa commented 1 year ago

Here is a simple reproducer:

import Data.Semigroup
import System.CPUTime
import Text.ParserCombinators.Parsec

main :: IO ()
main = do
  time0 <- getCPUTime
  check $ stimes 10000 "a "
  time1 <- getCPUTime
  print $ (time1 - time0) `div` 1000000000

parser :: Parser [String]
parser = many (char 'a' <|> char 'b') `sepBy` char ' '

check :: String -> IO ()
check s = putStrLn $ either show (const "") $ parse parser {- important: pass input as SourceName -} s s

Takes 20 ms with parsec 3.1.14.0 and 1180ms with parsec 3.1.16.0

ulidtko commented 1 year ago

Just to be clear, I'm not arguing to revert b00129b. The fix by @wz1000 is entirely sensible; the only reason it's mentioned at all — is this empirical case, where reverting it somehow appears to fix a 100x slowdown.

... Minimized. IMO this can get controversial: shakespeare misuses the parse function — see https://github.com/yesodweb/shakespeare/pull/277 ... and was doing that for the last 13 years :sob:

As also shown by the previous commenter. :eyes:

This should make sense: the SourceName becomes part of SourcePos becomes part of ParseError. With the entire input in SourceName, the increased SourcePos comparisons brought in by #127 create this runtime blowup.

silencespeakstruth commented 1 year ago

Somehow I believe that hashing the SourceName with the help of a proven crypto hash alg like SHA512 could eliminate the problem. Then hash would be computed once and thus minimize comparisons while keeping the #127 which is by no doubt also important. @phadej could that be a valid course of action?

My standpoint is that whatever Shakespeare has been doing wrong all these ~10 years, it worked just fine and thus Shakespeare has to be ignored in this discussion; it's a separate unrelated issue and to be fixed accordingly. @phadej, can you please confirm that the reproducer made by @aadaa-fgtaa satisfies the expectations?

yitz-zoomin commented 1 year ago

I somehow doubt we need a cryptologically hard hash for SourceName. At first glance, it would seem this is just a human-readable string to help debug an error, and uniqueness is not a hard requirement. If so, just take some reasonably sized prefix. I would do that also, even though the real solution for the problem at hand is to make sure we don't churn through the SourceName at every step of the parsec monad.

ulidtko commented 1 year ago

For posterity, shakespeare-2.1.0.1 solves the regression for us. yesodweb/shakespeare#277

Though of course, more users of parsec may be affected.

I'm still failing to understand 2 questions.

One, all that mergeErrors apparently cares about is length of the "messages" lists... "select the longest match". Why should it bother with comparing SourcePos's at all? Their Eq instance makes sense — but Ord instance looks contrived/synthetic and really unnecessary.

Why not, perhaps, this?

@ src/Text/Parsec/Error.hs:147 @ mergeError e1@(ParseError pos1 msgs1) e2@(ParseError pos2 msgs2)
     -- prefer meaningful errors
     | null msgs2 && not (null msgs1) = e1
     | null msgs1 && not (null msgs2) = e2
+    | pos1 == pos2 = ParseError pos1 (msgs1 ++ msgs2)
     | otherwise
-    = case pos1 `compare` pos2 of
+    = case length msgs1 `compare` length msgs2 of
         -- select the longest match
         EQ -> ParseError pos1 (msgs1 ++ msgs2)
         GT -> e1

Second, (I asked this above but didn't get any reply), the empty ok continuation (3rd argument to ParsecT) — is only ever called when the parser accepts empty input. Right?

This'd mean that it is never actually called with any errors. Why bother with merging them then, why not this?

         -- empty-ok case for m
-        meok x s err
-          | errorIsUnknown err = unParser (k x) s cok cerr eok eerr
-          | otherwise =
-            let
-                -- in these cases, (k x) can return as empty
-                pcok = cok
-                peok x s err' = eok x s (mergeError err err')
-                pcerr = cerr
-                peerr err' = eerr (mergeError err err')
-            in  unParser (k x) s pcok pcerr peok peerr
+        meok x s _err = unParser (k x) s cok cerr eok eerr
+
         -- consumed-error case for m
         mcerr = cerr

This passes all tests. And improves performance for the pathological "40kiB filename" users.

phadej commented 1 year ago

For posterity, shakespeare-2.1.0.1 solves the regression for us

You can tell shakespear maintainers that they can revert the fix if they don't like it, as parsec will be fixed.

Please give people time.

about is length of the "messages" lists

Incorrect. About longest parse.

ulidtko commented 1 year ago

@phadej ok. Then the instance Ord SourcePos is incorrect. On current master, mergeErrors selects those errors whose position compares "the greatest" — which means, with lexically biggest SourceName, falling back to largest line (col) number.

phadej commented 1 year ago

Then the instance Ord SourcePos is incorrect.

That's wrong analysis too. Nothing wrong with Ord SourcePos. There's logical error in the setup. See #175

haskell / parsec

The recent fix of the `(>>=)` memory leak seems to cause an enormous performance degradation. #171

Short story

Long story and more fruitful pieces of evidence

Step 1: `perf` profiling

Step 2: `RTS` flags

Step 3: `GHC` intermediate analysis

The conclusion

P.S.

haskell / parsec

The recent fix of the `(>>=)` memory leak seems to cause an enormous performance degradation. #171

Short story

Long story and more fruitful pieces of evidence

Step 1: perf profiling

Step 2: RTS flags

Step 3: GHC intermediate analysis

The conclusion

P.S.

Step 1: `perf` profiling

Step 2: `RTS` flags

Step 3: `GHC` intermediate analysis