Closed valencik closed 1 year ago
Oh, that looks like an odd issue and it's the first performance bug report in ages, too.
I think the example is minimal enough and a good starting point to investigate, so thank you very much for providing it.
I also already think I know what is going on (without actually looking at the implementation). The definition for inline-HTML is very flexible in Markdown, HTML tags can contain markup, and markup spans can contain HTML. For this reason the two parsers run in the same phase and in your example each lone asterisk will trigger the search for an emphasized span, but the way it closes in a different cell then means the HTML tag itself never closes (and it will verify that to the very end of the input).
I only explain this in more detail to clarify that this might be open for a bit as it's not a simple performance bug, but something that might require to implement Markdown HTML parsing in a two-phase model which is difficult for the reasons explained above (as the nesting can also be the other way round).
From a purely pragmatic standpoint, I assume you are unable to use GitHub Flavour table syntax for your use case? Because that does not come with any of that complexity as it's proper 2-phase parsing already and the speed should be fine.
That's great context. :)
And yeah I think we have a couple options available to us at work, so I'm not too worried about this being fixed soon.
I finally found some time to properly investigate this issue and I'm inclined to close it as "not a bug" as there are several workarounds to avoid the problem and in cases none of the workarounds are applied, it is a genuinely unusual and difficult to parse block where a large number on unclosed spans will be unsuccessfully attempted before falling back to basic text input.
There are also a few issues with the test setup that I initially did not spot when I skimmed over the ticket and there are also subtle oddities in how embedded HTML in Markdown is defined in the original spec, so I'll go over all the aspects one by one.
Independent of the discussion below and as previously mentioned, my no.1 recommendation usually is to avoid embedded HTML in Markdown. For tables there is the table syntax as defined in GitHub Flavour and for other scenarios either directives or renderer overrides might give you the flexibility you need. Embedded HTML in Markdown is usually hard to read and ties the input to an output format. If you really need to use embedded HTML see all the following points.
Embedded HTML is turned off by default in Laika for security reasons and for consistency. In your test setup above you would need to use MarkupParser.of(Markdown).withRawContent.build
. Otherwise the result is just a regular Paragraph
node. When in doubt it always helps to print out the formatted AST and check it is what you expected.
The Markdown spec makes a distinction between HTML blocks and inline HTML. In case of an HTML block the entire block is a valid HTML element with start and end tag and raw HTML in between. In this case Markdown does not allow embedded markup, it is interpreted solely as raw HTML. In case of inline HTML, some elements within a normal text paragraph, it allows embedded markup in which case the lone asterisks start to kick in and degrade performance. Your use case should not suffer from this as it is intended to be an HTML block, however, it is not well-formed and as such not recognised correctly and parsed as the other option, a mix of interspersed HTML and markup. If you fix the syntax the performance issue goes away (see corrected source below).
Finally, in other cases where you actually need HTML and markup interspersed, you can prevent the excessive backtracking by escaping the asterisks: fullMd.replaceAll("\\*", "\\\\*")
In summary, applying the changes described in 2., together with either 3. or 4. will fix the issue, and the cases where none of the fixes are applied look like an exotic edge cases that should not occur in practice and might be difficult or even infeasible to fix as an unescaped asterisk has to be tried as an emphasised span first.
Corrected setup:
def tableGen(n: Int, includeAsterisks: Boolean): String = {
val pre =
"""|Here is a table
|
|<table>
|<caption> this is a caption </caption>
|<tbody>
|""".stripMargin
val post =
"""|
|</tbody>
|</table>
|""".stripMargin
I'm closing this now as an unfixable edge case where the quadratic complexity is inherit in the markup, caused by a false positive markup character (an asterisk that is meant to be just a literal) in an overlong paragraph.
In practice this scenario should be very rare (in particular overlong paragraphs are not common). The workaround in such a case would be to escape the asterisk, or in this specific example above, to ensure the block is properly recognised as inline HTML which will ignore any markup characters.
Feel free to reopen this if you think there is still something that could be improved on the library side.
We ran into a performance issue with parsing some html table data in a markdown file. I've tried to minimize the issue here, but unfortunately it is not very minimal afterall.
The following code will repeatedly call
transformToUnresolvedAST
(from Transformer.scala) with progressively more table rows from theraw
input.The output, on my machine, looks like:
Which shows that by the time we get to all 27 rows parsing takes 5 seconds. If you remove asterisks from the data parsing goes back to taking about a millisecond, no matter how many rows. You can do this by changing
tableGen(n, includeAsterisks=true)
totableGen(n, includeAsterisks=false)
below.