Closed stchang closed 9 years ago
Oh and if we can figure out how to make the branches disjoint in $whitespace
, if might improve things some more, since $spaces->space
gets hit much more than $line-break
. This might not be possible though, I don't know.
Thank you for looking at this!
Have you played around with the
ordering any?
Not with performance in mind. Only with correctness in mind -- and noticing that a grammar fix sometimes made the performance slower. I should have thought to try working that in the other direction, as you're suggesting. Good idea!
The tests are fairly good coverage -- especially if you also run suite-test.rkt. Which you can't, unfortunately, because I didn't set it up that way due to licensing. I should figure out a way to set that up.
I think it's just the bracketed-style of the grammar leads to too much backtracking, which my parser doesn't handle well.
To a naive user (me) the backtracking is invisible. It might be interesting to have a debugging switch to print info about this, so users could see/feel the cost. (The trick might be how to summarize/present the info. I'm guessing a raw trace
would be way too verbose. Maybe you need a dB
scale? :))
Anyway -- thanks again for following up! I will try the changes you suggested, and look for others in that vein.
Oops, right, I meant to say that all tests pass except the suite-test.rkt
. Forgot that I disabled it.
To a naive user (me) the backtracking is invisible.
Yeah it is sort of unintuitive because normally you expect branching to be a fast check, but in this case, each <or>
choice is possibly a chain of parsers, and it may not fail until you get to the last in the chain, in which case you have to go back and start at the beginning with the next <or>
choice.
One easy thing to do is to make sure the <or>
choices don't have redundancy in common prefixes. Other string matching heuristics could probably be applied as well.
It might be interesting to have a debugging switch to print info about this.
I'll try to check in my "profiler" later today. It essentially just counts how often each <or>
choice is triggered in a program run. And ideally, assuming a representative run, the most common should go first (and assuming no other dependency constraints).
I ran into some severe issues using markdown on files containing lists. Specifically, I have a file with a list of length 1600, and... it takes a long, long time to process. I decided to run some experiments, using this file
#lang racket
(require markdown/parse)
(define markdown-header
#<<|
# My beautiful Markdown File
I have some things to say. Here they are:
|
)
(define (random-list-entry)
(define r (random-string))
(~a "- ["r"](http://example.com/"r")\n"))
(define (random-string)
(apply
string-append
(for/list ([i 5])
(bytes->string/utf-8 (bytes (+ (bytes-ref #"a" 0) (random 26)))))))
(define (run-trial size)
(define f (make-temporary-file))
(call-with-output-file f
(lambda (port)
(display markdown-header port)
(for ([i size])
(display (random-list-entry) port)))
#:exists 'truncate)
(parse-markdown f))
(for/list ([size (in-range 0 5000 500)])
(define-values (_1 cpu-msec _2 _3) (time-apply run-trial (list size)))
(printf "(~a ~a)\n" size cpu-msec))
Then, I ran it for sizes from 0 up to 4500 in increments of 500, and plotted them. Here's my data:
#lang racket
(require plot)
(define data2
'((0 8)
(500 928)
(1000 3375)
(1500 6623)
(2000 11084)
(2500 17008)
(3000 24204)
(3500 33739)
(4000 43472)
(4500 56053)))
(plot (list (lines data2)
(function (lambda (x)
(* 0.0027 (* x x))))))
... and here's the result (the 0.0027 is just me trying to fit a curve):
Basically, it's n^2.
I'm assuming that this is backtracking, probably on the "is the list over yet" rule.
Is there any obvious way to restructure the parsing rules to speed this up? I would think this would be a standard problem for combinator-based parsers.
Parsing markdown lists is a bit tricky due to the need to:
---
or - - -
as unordered list itemsThe existing grammar for lists originated from my attempted direct translation of Pandoc's Haskell Parsec code, a couple years ago. At a quick glance, that Pandoc code has evolved since then -- perhaps for performance tunning like this. I'll try to find a block of time in the near future to study that, as well as think about it fresh myself.
Meanwhile if you were feeling antsy and ambitious, and wanted to attempt a PR? :) The relevant code is here. (Again, the latest Pandoc code for this is here.)
Although my brain is having some cache misses wrt Parsack, I think all these notFollowedBy
s are expensive and I'd seek to minimize or eliminate them somehow.
(define $list-line
(try (pdo (notFollowedBy $list-start)
(notFollowedBy $blank-line)
(notFollowedBy (pdo-seq $indent
(many $space-char)
$list-start))
(xs <- (manyTill $anyChar $newline))
(return (string-append (list->string xs) "\n")))))
@jbclements I finally got some time to reload my brain and work on this, over the last couple days. TL;DR I can't optimize until I improve my mental model; I don't have a good intuition for where it spends time. Also, the Racket profiler results are "opaque" (to me, so far). Although I'll keep working on it, realistically it's not likely to improve soon, at least not without me getting some help. One sort of help would be to crib from Pandoc, but, I'd prefer not to -- and anyway I don't even know how much that would help (i.e. how much of this is the grammar, vs. how much is Parsec vs. Parsack).
@stchang You mentioned: "I think it's just the bracketed-style of the grammar leads to too much backtracking, which my parser doesn't handle well." How difficult would it be to improve backtracking in Parsack? Do you think this is true about Parsack but not Parsec, or, is it something inherent to all Parsec-like combinator libraries?
It's probably a combination of culprits. Parsing Markdown naturally backtracks a lot, but my parser implementation is very rudimentary as well. It is based on the original Parsec paper and could probably benefit from looking at the current implementation. I'll try to take a look later today to see if I can improve this case.
As I'm looking at this, one interesting (non-performance-related) thing is that it's not even parsing as lists (ie <li>
s).
According to babelmark, this is not uncommon I guess: http://johnmacfarlane.net/babelmark2/?text=%23+My+beautiful+Markdown+File%0A%0AI+have+some+things+to+say.+Here+they+are%3A%0A-+%5Brzvvy%5D(http%3A%2F%2Fexample.com%2Frzvvy)%0A-+%5Btcjtz%5D(http%3A%2F%2Fexample.com%2Ftcjtz)%0A-+%5Bdrzgy%5D(http%3A%2F%2Fexample.com%2Fdrzgy)%0A-+%5Bkheng%5D(http%3A%2F%2Fexample.com%2Fkheng)%0A-+%5Bpbriy%5D(http%3A%2F%2Fexample.com%2Fpbriy)%0A-+%5Bxhllj%5D(http%3A%2F%2Fexample.com%2Fxhllj)%0A-+%5Bsrokl%5D(http%3A%2F%2Fexample.com%2Fsrokl)%0A
So I didnt get to the root of the list parsing issues, but I just pushed a change https://github.com/stchang/parsack/commit/beddbd2ba7de85745c1023a0177d16efd6fe16a1 that improves parsack's performance by 3-4x, according to the markdown performance tests:
Before:
$ raco test perf-test.rkt
raco test: (submod "perf-test.rkt" test)
Strict mode:
Using /home/stchang/racket/markdown-stchang/markdown/test/test.md
appended 5 times: 17398 chars and 974 lines. Doing 5 timings...
Timings: 888, 928, 954, 976, 1000 (sorted)
Average: 949.2
Non-strict mode (with extensions):
Using /home/stchang/racket/markdown-stchang/markdown/test/test.md
appended 5 times: 17398 chars and 974 lines. Doing 5 timings...
Timings: 1027, 1080, 1185, 1195, 1205 (sorted)
Average: 1138.4
6 tests passed
After:
$ raco test perf-test.rkt
raco test: (submod "perf-test.rkt" test)
Strict mode:
Using /home/stchang/racket/markdown-stchang/markdown/test/test.md
appended 5 times: 17398 chars and 974 lines. Doing 5 timings...
Timings: 301, 302, 309, 311, 328 (sorted)
Average: 310.2
Non-strict mode (with extensions):
Using /home/stchang/racket/markdown-stchang/markdown/test/test.md
appended 5 times: 17398 chars and 974 lines. Doing 5 timings...
Timings: 314, 314, 317, 318, 319 (sorted)
Average: 316.4
6 tests passed
@jbclements's code improves by 2-3x
On Jun 19, 2015, at 4:41 PM, Stephen Chang notifications@github.com wrote:
So I didnt get to the root of list parsing, but I just pushed a change beddbd2ba7de85745c1023a0177d16efd6fe16a1 that improves parsack's performance by 3-4x, according to the markdown performance tests:
Excellent. much appreciated.
I didn’t mention this before, but I tried using parsack before for breaking files into words, and it was really unusable. Looking forward to improved performance in frog and elsewhere!
Best,
John
@stchang wrote:
As I'm looking at this, one interesting (non-performance-related) thing is that it's not even parsing as lists (ie
<li>
s).
Good point. A blank line before the first list item will make this unambiguous. That is, in @jbclements example change (display markdown-header port)
to (displayln markdown-header port)
. Or:
> (parse-markdown "Para\n- still para")
'((p () "Para - still para"))
> (parse-markdown "Para\n\n- list item")
'((p () "Para") (ul () (li () "list item")))
However back to performance, that doesn't make it parse faster; on the contrary.
Here are the numbers from @jbclements's list test, after the most recent parsack update:
Before:
(0 36)
(500 940)
(1000 4944)
(1500 15017)
(2000 25270)
(2500 41599)
(3000 59155)
(3500 78989)
<arg, killed>
After:
$ racket lists.rkt
(0 4)
(500 108)
(1000 164)
(1500 240)
(2000 328)
(2500 400)
(3000 488)
(3500 612)
(4000 648)
(4500 728)
Parsing as <li>
s:
$ racket lists.rkt
(0 4)
(500 192)
(1000 324)
(1500 480)
(2000 640)
(2500 800)
(3000 960)
(3500 1124)
(4000 1328)
(4500 1436)
Might be good to add this as another test?
Resolved with https://github.com/stchang/parsack/issues/39 and https://github.com/greghendershott/markdown/pull/51. Huge thanks to @stchang !
I've been trying to improve the performance of the markdown parser through
parsack
improvements but haven't had much luck so far. I think it's just the bracketed-style of the grammar leads to too much backtracking, which my parser doesn't handle well.That did give me an idea to profile some examples (ie your tests) to see if I could improve the ordering on any of the
<or>
options, to cut down on the backtracking.For example the following changes produces a decent speedup of
perf-test.rkt
:$normal-char
.<or>
insource-url
.$inline
, drop$smart-punctuation
down to above$code
.$inline
, drop the4+ ...
down to above$footnote-ref
.Obvious caveats:
perf-tests.rkt
is.<or>
choices matters, when the choices are not disjoint, so I don't know if I've actually messed something up. But all the tests do pass.random-test.rkt
. I guess this is unsurprising sincerandom-test
emphasizes special chars?Some numbers:
Before: non-strict:
3200
, strict:3340
After: non-strict:2598
, strict:2547
Have you played around with the
<or>
ordering any?