Closed silencespeakstruth closed 1 year ago
instance Ord what
ping @wz1000. Now I regret that I merged https://github.com/haskell/parsec/pull/128 without requiring a benchmark
TBH, I could undo #127 fix, as its motivation is replicating pure parser, i.e. somewhat contrived. This is more substantial. But either way, there should be a benchmark to verify that the change works (e.g. across different GHC versions etc. if it's an issue on all of them to begin with)
@phadej, thanks for a quick reply. Following your questions:
From what perf
is able to capture, there is no way to give a definite answer of which instance Ord
that is. Based on how GHC Core dump looks, mergeError
is the primary suspect. I believe GHC uses a deterministic algorithm for naming things, so its origin could be traced down by the parts of which ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc
consists; no certainty here - I may be wrong.
There were different benchmarks. In the beginning, the slowdown was noticed for the unit and integration tests we have. There are many of these, so I narrowed the problem down to a few. For quite some time I investigated the problem based on them. Later on another - much more minimized and precise - benchmark was created, but it uses the shakespeare-2.1.0
package to call a parsec
(because this is how the problem revealed itself originally). Either way yields exactly the same observations, so I omitted all the unrelated details and kept my report simple and concise. Unfortunately, both do use internal code under a NDA and I can't not share it due to legal restrictions.
In essence, every benchmark is just a call to the parseLines
with some proprietary shakespeare
-specific template, which is nothing but a text to be parsed and transformed later on. A typical template would be like five hundred lines of text with lots of @{url}
, #{variable}
, ^{fragment}
, all of which are again shakespeare
-specific. parseDoc
is the entry point - but again, I intentionally avoided all these details so as not to distract reader's attention from the important bits.
Both GHC 9.0.2 and 9.2.8 were involved in benchmarking. The issue seems to disappear once you revert the #127 and its fix (#128). The other possible solution is to forcibly keep an old parsec
package around, for example parsec-3.1.14.0.
A small(ish) reproducer would be good to have.
ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc
is instance Ord String
λ> import GHC.Utils.Encoding
λ> zDecodeString "ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc"
"ghc-prim_GHC.Classes_$fOrd[]_$s$c"
I haven't tested but this patch might help:
diff --git a/src/Text/Parsec/Error.hs b/src/Text/Parsec/Error.hs
index 9cf582a..ce0f0b5 100644
--- a/src/Text/Parsec/Error.hs
+++ b/src/Text/Parsec/Error.hs
@@ -142,8 +142,8 @@ setErrorMessage msg (ParseError pos msgs)
mergeError :: ParseError -> ParseError -> ParseError
mergeError e1@(ParseError pos1 msgs1) e2@(ParseError pos2 msgs2)
-- prefer meaningful errors
- | null msgs2 && not (null msgs1) = e1
- | null msgs1 && not (null msgs2) = e2
+ | null msgs1 = e2
+ | null msgs2 = e1
| otherwise
= case pos1 `compare` pos2 of
-- select the longest match
I haven't tested but this patch might help:
We need a small reproducer, I want make another mistake of accepting a performance related patch without a benchmark. Sorry.
I don't really rely on parsec critically for any performance related code, I put up this diff as well as the original patch to be helpful to others. I did not really need it to be merged.
For the record, the leak in #127 was affecting non-trivial ParsecT
code, where a big loop lifting a monadic action into ParsecT
leaked memory: loopManyTimes $ lift someMonadicAction :: ParsecT _ _ M _
Interesting! pos1 `compare` pos2
is comparing SourcePos
, which is defined as:
type SourceName = String
type Line = Int
type Column = Int
data SourcePos = SourcePos SourceName !Line !Column
deriving ( Eq, Ord, Data, Typeable)
That stock-derived Ord
is going to first inspect the SourceName ~ String ~ [Char]
. That could trigger the Ord [a]
dictionary.
Yes, I was going to try to use different comparator in mergeError
, but I repeat: without a reproducer no change goes in.
Hi all. I'm collaborating with @silencespeakstruth on the same issue.
We're still working out a shareable reproducer; sorry for starting the discussion somewhat "prematurely".
@wz1000 I tested the patch — doesn't seem to make any difference (both msgs1
and msgs2
get forced to WHNF anyhow).
@parsonsmatt good point, I tried this:
instance Ord SourcePos where
compare (SourcePos snA lnA colA) (SourcePos snB lnB colB)
= lnA `compare` lnB <> colA `compare` colB <> snA `compare` snB
— it does give an improvement in allocations and a few % speedup. Looks like a sensible improvement in general.
However, we're measuring 100⨯ slowdown of parsec-3.15 relative to parsec-3.14 — in a use-case involving big-ish inputs to yesodweb/shakespeare (as an overall context; didn't see this clearly stated above).
Obviously, a couple % on the background of 100⨯ feels far from addressing the root cause — which we haven't yet pinned down, neither differentiated from a deficiency in shakespeare's parser.
Question 1. Any good reason I'm missing for the "empty ok" continuation in parserBind
to concern itself with errors?
This passes all tests of parsec and shakespeare, and cuts down the calls to mergeError
:
-- empty-ok case for m
- meok x s err
- | errorIsUnknown err = unParser (k x) s cok cerr eok eerr
- | otherwise =
- let
- -- in these cases, (k x) can return as empty
- pcok = cok
- peok x s err' = eok x s (mergeError err err')
- pcerr = cerr
- peerr err' = eerr (mergeError err err')
- in unParser (k x) s pcok pcerr peok peerr
+ meok x s _err = unParser (k x) s cok cerr eok eerr
+
-- consumed-error case for m
mcerr = cerr
Can a parser's "empty ok" ever be called with ParseError
distinct from unknownError
?
Question 2. The {-# INLINE parserBind #-}
pragma doesn't look good to me:
https://github.com/haskell/parsec/blob/c5add8bd1da56ee11fd327409b3b46aa8015a974/src/Text/Parsec/Prim.hs#L316-L318
The function isn't "small". Perhaps INLINABLE
?
Question 3. This implementation for many
raises doubts:
I understand it may well predate Alternative
and Applicative
... why not Control.Applicative.many
instead?
The {-# INLINE parserBind #-} pragma doesn't look good to me: The function isn't "small". Perhaps INLINABLE ?
Bind not inlining usually pessimises the demand analyser quite a bit for monadic code, so I would be wary of such a change.
Does removing the inline pragma help with your reproducer?
... not really :sweat_smile: With results cherry-picking, there's maybe a 20ms speedup without the {-# INLINE #-}
— but that really needs more rigorous measurement with criterion (that takes recompiles). Nevermind @wz1000, it wouldn't be smart to split hairs over this.
The biggest-impact tweak I tested is the meok
diff (Question 1). It brings our benchmark from 2640 ms to 380 ms. The baseline of this benchmark, with parsec-3.1.14 or #127 reverted — 25 ms.
This implementation for many raises doubts:
It's "correct". What's wrong is that Alternative ParsecT
doesn't define some
and many
, I thought it did. I'll fix that.
The many
implementation itself is fine, and being able to define specific many
and some
s is quite important. (reverse
is suspicious, so maybe it could accumulate using a DList
, but I don't think that matters in practice)
@phadej sure, understood. FWIW, I'm measuring no difference in runtime between this many
and the blanket Alternative's one.
We've got an approve to share the reproducer.
Steps:
git clone --branch bench/parsec-3.15 https://github.com/zoominsoftware/shakespeare
cd shakespeare
cabal bench
On my machine, that yields:
Parsed successfully in 2556 cpu-milliseconds
As a baseline, this same benchmark without #127:
Parsed successfully in 35 cpu-milliseconds
I'm using cabal.project
to take parsec from a local copy to do that, applying git revert b00129b5c18be6b2ffba3519761d90bc7ad0764c
on top of parsec's master in that copy. There's a stack.yaml
prepared for the same in that branch.
Local GHC version: 9.2.7
That is not a small reproducer. Please try to minimise.
In fact I was mistaken, (I'm sincerelly sorry @wz1000, I forgot I added the test-suite), there is a test for #127 in parsec
, i.e. parsec-issue127
test-suite. I'd like to have something similar for this issue.
I.e. today if I run cabal run parsec-issue127 -- +RTS -s
, I get something like
Right ()
152,052,256 bytes allocated in the heap
19,624 bytes copied during GC
44,576 bytes maximum residency (2 sample(s))
29,152 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
But if b00129b5c18be6b2ffba3519761d90bc7ad0764c is reverted, then
Right ()
208,052,256 bytes allocated in the heap
343,729,016 bytes copied during GC
52,524,512 bytes maximum residency (8 sample(s))
1,432,096 bytes maximum slop
50 MB total memory in use (0 MB lost due to fragmentation)
i.e. a memory leak. So b00129b5c18be6b2ffba3519761d90bc7ad0764c
is not going to be reverted just because there is another performance issue.
I don't want to decide which is more important. I'd like to fix both. But to do that, I need a small reproducer for this issue.
Here is a simple reproducer:
import Data.Semigroup
import System.CPUTime
import Text.ParserCombinators.Parsec
main :: IO ()
main = do
time0 <- getCPUTime
check $ stimes 10000 "a "
time1 <- getCPUTime
print $ (time1 - time0) `div` 1000000000
parser :: Parser [String]
parser = many (char 'a' <|> char 'b') `sepBy` char ' '
check :: String -> IO ()
check s = putStrLn $ either show (const "") $ parse parser {- important: pass input as SourceName -} s s
Takes 20 ms with parsec 3.1.14.0 and 1180ms with parsec 3.1.16.0
Just to be clear, I'm not arguing to revert b00129b. The fix by @wz1000 is entirely sensible; the only reason it's mentioned at all — is this empirical case, where reverting it somehow appears to fix a 100x slowdown.
... Minimized. IMO this can get controversial: shakespeare
misuses the parse
function — see https://github.com/yesodweb/shakespeare/pull/277 ... and was doing that for the last 13 years :sob:
As also shown by the previous commenter. :eyes:
This should make sense: the SourceName
becomes part of SourcePos
becomes part of ParseError
. With the entire input in SourceName
, the increased SourcePos comparisons brought in by #127 create this runtime blowup.
Somehow I believe that hashing the SourceName
with the help of a proven crypto hash alg like SHA512
could eliminate the problem. Then hash would be computed once and thus minimize comparisons while keeping the #127 which is by no doubt also important. @phadej could that be a valid course of action?
My standpoint is that whatever Shakespeare has been doing wrong all these ~10 years, it worked just fine and thus Shakespeare has to be ignored in this discussion; it's a separate unrelated issue and to be fixed accordingly. @phadej, can you please confirm that the reproducer made by @aadaa-fgtaa satisfies the expectations?
I somehow doubt we need a cryptologically hard hash for SourceName
. At first glance, it would seem this is just a human-readable string to help debug an error, and uniqueness is not a hard requirement. If so, just take some reasonably sized prefix. I would do that also, even though the real solution for the problem at hand is to make sure we don't churn through the SourceName
at every step of the parsec monad.
For posterity, shakespeare-2.1.0.1 solves the regression for us. yesodweb/shakespeare#277
Though of course, more users of parsec may be affected.
I'm still failing to understand 2 questions.
One, all that mergeErrors
apparently cares about is length of the "messages" lists... "select the longest match". Why should it bother with comparing SourcePos
's at all? Their Eq instance makes sense — but Ord instance looks contrived/synthetic and really unnecessary.
Why not, perhaps, this?
@ src/Text/Parsec/Error.hs:147 @ mergeError e1@(ParseError pos1 msgs1) e2@(ParseError pos2 msgs2)
-- prefer meaningful errors
| null msgs2 && not (null msgs1) = e1
| null msgs1 && not (null msgs2) = e2
+ | pos1 == pos2 = ParseError pos1 (msgs1 ++ msgs2)
| otherwise
- = case pos1 `compare` pos2 of
+ = case length msgs1 `compare` length msgs2 of
-- select the longest match
EQ -> ParseError pos1 (msgs1 ++ msgs2)
GT -> e1
Second, (I asked this above but didn't get any reply), the empty ok continuation (3rd argument to ParsecT
) — is only ever called when the parser accepts empty input. Right?
This'd mean that it is never actually called with any errors. Why bother with merging them then, why not this?
-- empty-ok case for m
- meok x s err
- | errorIsUnknown err = unParser (k x) s cok cerr eok eerr
- | otherwise =
- let
- -- in these cases, (k x) can return as empty
- pcok = cok
- peok x s err' = eok x s (mergeError err err')
- pcerr = cerr
- peerr err' = eerr (mergeError err err')
- in unParser (k x) s pcok pcerr peok peerr
+ meok x s _err = unParser (k x) s cok cerr eok eerr
+
-- consumed-error case for m
mcerr = cerr
This passes all tests. And improves performance for the pathological "40kiB filename" users.
For posterity, shakespeare-2.1.0.1 solves the regression for us
You can tell shakespear maintainers that they can revert the fix if they don't like it, as parsec
will be fixed.
Please give people time.
about is length of the "messages" lists
Incorrect. About longest parse.
@phadej ok. Then the instance Ord SourcePos
is incorrect. On current master, mergeErrors
selects those errors whose position compares "the greatest" — which means, with lexically biggest SourceName
, falling back to largest line (col) number.
Then the instance Ord SourcePos is incorrect.
That's wrong analysis too. Nothing wrong with Ord SourcePos
. There's logical error in the setup. See #175
To whom it may concern...
Short story
An attempt to migrate our production to the LTS-20.26 (GHC 9.2.8) accidentally discovered a significant performance downfall.
Without the #127:
...and with it:
Based on these (and many others, shared down below) observations, I tend to conclude that #127 performs about ~100 times slower.
Long story and more fruitful pieces of evidence
Once unwanted slowness was spotted, a drill-down analysis was carried out. Below I am going to describe every step taken and its outcome.
Step 1:
perf
profilingperf record -F 33 -g -e cycles:u -- $EXE +RTS -g2 -RTS
was used to profile the executable ($EXE
) built with the-O 2
flag to ensure maximum optimization. Both profiles then were compared against each other. The following was observed for the slowed-down version of the code:Cleary,
ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc
stands forinstance Ord
implementation.98,53%
stands for "98,54% of all the userspace cycles were spent executing theghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc
". That directed the investigation towardsRTS
-based flag instrumentation in order to be able to get a deeper insight into where these comparisons come from.Step 2:
RTS
flags$EXE +RTS -P -RTS
produced a Haskell-native runtime profile, which then was visualized with a flame graph using theghc-prof-flamegraph
command line utility.Comparing the flame graphs clearly highlighted that the
parseLines :: HamletSettings -> String -> Result (Maybe NewlineStyle, HamletSettings, [(Int, Line)])
started to eat up significantly more CPU time [^1].That pointed here (to the
Parsec
itself) and its changelog, where #127 popped up as a potential root cause. Reverting the #128 only (which fixed the #127) indeed rollbacks performance back to the expected measure. Trying to figure out what's happening it was decided to dump GHC intermediate representation.Step 3:
GHC
intermediate analysisThe problematic piece of code was built with the
-ddump-simpl -dsuppress-all -dno-suppress-type-signatures
and those were compared. It turned out that the misperforming version generates more code of the following pattern:and the
$fOrd[]_$s$ccompare1
method called about 8 times more often. Unfortunately, I lack knowledge of Haskell internals and thus am unable to analyze the intermediate representation further.The conclusion
To my understanding, these 3 steps advocate #127 to be the root cause. Yet I fail to explain how could that happen. As of now, a hacky hotfix allows us to migrate to the
GHC 9.2.8
avoiding the regression, however, I would like to open a discussion on how one could possibly fix it. The open questions are:-ddump-simpl -dsuppress-all -dno-suppress-type-signatures
flags and can it help to make the issues even more concrete?P.S.
At the moment the code involved in instrumentation lies under an NDA. Thus, I do not share it intentionally due to legal restrictions. I hope the observations above will be sufficient for this discussion to progress.
[^1]: The
shakespeare-2.1.0
is running in the production.