Open jgm opened 4 years ago
Unfortunately, unless something can be done, this issue is probably going to force me to switch back to using yaml
in pandoc, which I'm unhappy about -- but people have some large YAML files to process. I tried doing some profiling with explicit SCC annotations. This seems to indicate that most of the time is spent in Data.YAML.Token c_l_block_seq_entry
, which I suppose is what you'd expect for this input, but I wasn't yet able to pin it down further and nothing obvious has jumped out...
Performance is 10 times worse than yaml
package: https://gitlab.haskell.org/haskell/ghcup-hs/-/issues/270
My tests seem to indicate that it's Data.DList.toList
:
There's no problem with dlist, as far as I can see, so this profile doesn't tell us where the problem really lies. I tried replacing dlist with Data.Sequence from containers (which is a dependency of this package anyway), and this didn't affect performance significantly. After that, profiling says
97.3 97.3 96.7 Data.YAML.Token tokenize (0)
It would be nice to make progress on this issue. Maybe the -fprof-late-ccs
option announced for GHC 9.4 could help getting more insights on this.
I've had a brief look at the Core of Data.Yaml.Token
. What I noticed so far is that there's a lot of reboxing of Reply
s. For example:
-- RHS size: {terms: 12, types: 23, coercions: 6, joins: 0/0}
tokenize135 :: State -> Reply ()
[GblId,
Arity=1,
Str=<L>,
Cpr=1,
Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=0,unsat_ok=True,boring_ok=False)}]
tokenize135
= \ (w :: State) ->
case $w$c*>
@() @() (tokenize153 `cast` <Co:3>) (tokenize136 `cast` <Co:3>) w
of
{ (# ww1, ww2, ww3, ww4 #) ->
Reply @() ww1 ww2 ww3 ww4
}
-- RHS size: {terms: 12, types: 23, coercions: 6, joins: 0/0}
tokenize134 :: State -> Reply ()
[GblId,
Arity=1,
Str=<L>,
Cpr=1,
Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=0,unsat_ok=True,boring_ok=False)}]
tokenize134
= \ (w :: State) ->
case $w$c*>
@() @() (tokenize154 `cast` <Co:3>) (tokenize135 `cast` <Co:3>) w
of
{ (# ww1, ww2, ww3, ww4 #) ->
Reply @() ww1 ww2 ww3 ww4
}
I also thought that it was weird that *>
isn't inlined. I'm not sure whether this is entirely prevented by its recursive nature or whether an INLINE
pragma or just -O2
could fix that.
Rec {
-- RHS size: {terms: 16, types: 29, coercions: 0, joins: 0/0}
$fApplicativeParser2 [InlPrag=[2]]
:: forall {a} {b}. Parser a -> Parser b -> State -> Reply b
[GblId,
Arity=3,
Str=<1C1(P(1L,L,L,L))><L><L>,
Cpr=1,
Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=ALWAYS_IF(arity=3,unsat_ok=True,boring_ok=False)}]
$fApplicativeParser2
= \ (@a) (@b) (w :: Parser a) (w1 :: Parser b) (w2 :: State) ->
case $w$c*> @a @b w w1 w2 of { (# ww1, ww2, ww3, ww4 #) ->
Reply @b ww1 ww2 ww3 ww4
}
-- RHS size: {terms: 34, types: 60, coercions: 5, joins: 0/0}
$w$c*> [InlPrag=[2], Occ=LoopBreaker]
:: forall {a} {b}.
Parser a
-> Parser b
-> State
-> (# Result b, DList Token, Maybe Decision, State #)
[GblId, Arity=3, Str=<1C1(P(1L,L,L,L))><L><L>, Unf=OtherCon []]
$w$c*>
= \ (@a) (@b) (w :: Parser a) (w1 :: Parser b) (w2 :: State) ->
case (w `cast` <Co:2>) w2 of { Reply ds206 ds207 ds208 ds209 ->
case ds206 of {
Failed message2 -> (# Failed @b message2, ds207, ds208, ds209 #);
Result ds210 -> (# More @b w1, ds207, ds208, ds209 #);
More parser6 ->
(# More @b (($fApplicativeParser2 @a @b parser6 w1) `cast` <Co:3>),
ds207, ds208, ds209 #)
}
}
end Rec }
Maybe it would also be helpful to define Reply
as an unlifted type.
I also noticed that this package uses parsec
instead of megaparsec
which should be more optimized.
https://wg21.link/index.yaml could be used for benchmarking.
It would be nice to make progress on this issue. Maybe the
-fprof-late-ccs
option announced for GHC 9.4 could help getting more insights on this.
I've given this a spin, building with cabal build -w ghc-9.4 --enable-profiling --profiling-detail=none
and
--- a/HsYAML.cabal
+++ b/HsYAML.cabal
@@ -108,7 +108,7 @@ library
if !impl(ghc >= 7.10)
build-depends: nats >= 1.1.2 && < 1.2
- ghc-options: -Wall
+ ghc-options: -Wall -fprof-late
executable yaml-test
hs-source-dirs: src-test
@@ -133,7 +133,7 @@ executable yaml-test
else
buildable: False
- ghc-options: -rtsopts
+ ghc-options: -rtsopts -fprof-late
test-suite tests
default-language: Haskell2010
I then profiled the following command:
cat wg21.yaml | yaml-test yaml2event0 +RTS -p
…where wg21.yaml
is the file from https://wg21.link/index.yaml.
Results:
$w$c*> Data.YAML.Token <no location info> 47.1 17.3
$c*> Data.YAML.Token <no location info> 18.6 7.4
$wdecideParser Data.YAML.Token src/Data/YAML/Token.hs:599:7-18 5.2 4.1
$wnextIf Data.YAML.Token <no location info> 4.1 18.4
$sprefixErrorWith Data.YAML.Token <no location info> 2.7 7.1
$wchoiceParser Data.YAML.Token <no location info> 1.9 4.0
$wrejectParser Data.YAML.Token src/Data/YAML/Token.hs:675:5-16 1.4 0.0
choiceParser Data.YAML.Token <no location info> 1.3 1.2
$wfinishToken Data.YAML.Token <no location info> 1.2 4.2
$wwithParser Data.YAML.Token <no location info> 1.2 3.5
sol Data.YAML.Token <no location info> 1.0 1.8
value Data.YAML.Token <no location info> 0.9 1.8
c_forbidden Data.YAML.Token <no location info> 0.8 1.8
$srecovery Data.YAML.Token <no location info> 0.7 1.8
$w$stoken Data.YAML.Token <no location info> 0.6 3.8
$wemptyToken Data.YAML.Token <no location info> 0.5 3.0
append Data.DList <no location info> 0.4 1.6
$w$stoken Data.YAML.Token <no location info> 0.2 1.2
I'm not quite sure what $c*>
is – I can't find it in the generated Core or STG. Maybe it's an artifact of the -fprof-late
mode.
STG for the other top cost centers:
$w$c*>
$wdecideParser
$wnextIf
$sprefixErrorWith
I think some allocations could be avoided by turning Reply
and Result
into unlifted unboxed datastructures. State
is probably allocated with a similar frequency, but since it has so many fields, making it unlifted unboxed might result in too much register pressure.
It might be helpful to compress some of these fields into a bit field.
I wonder how far we can get by tweaking the existing code though. It's clear that the priority has been to fully comply with the YAML spec. Getting good performance out of the same code might be rather tricky.
One pandoc user has run into an issue with a large (100k line) bibliography in YAML format (for details see jgm/pandoc#6084). Prior to pandoc 2.8 (when we used the
yaml
package), this was handled fairly quickly, but now that we use HsYAML it takes 18 seconds to read the bibliography. I confirmed that the slowdown is due to HsYAML, by loading the file in a GHCI session asb
and tryingWhat are the performance expectations for HsYAML? Have you made efforts to optimize here? aeson claimed decoding speeds of 46M/sec on a slower machine than mine; this file is 3M. I wouldn't expect that YAML parsing could be as fast as JSON parsing, but it would be nice to get in the 4M/sec range (10x slower than aeson).
EDIT: 82G allocated with 1G max residency seems an awful lot to parse a 3M file!
Profiling reports these as the biggest cost centers:
Heap profiling shows that the DLists account for a lot of the allocation.