typelevel / Laika

Site and E-book Generator and Customizable Text Markup Transformer for sbt, Scala and Scala.js
https://typelevel.org/Laika/
Apache License 2.0
419 stars 46 forks source link

Slow Parsing of Long Markdown-HTML Tables With Asterisks #341

Closed valencik closed 1 year ago

valencik commented 1 year ago

We ran into a performance issue with parsing some html table data in a markdown file. I've tried to minimize the issue here, but unfortunately it is not very minimal afterall.

The following code will repeatedly call transformToUnresolvedAST (from Transformer.scala) with progressively more table rows from the raw input.

The output, on my machine, looks like:

tableGen(10) took 490 ms
tableGen(11) took 8 ms
tableGen(12) took 9 ms
tableGen(13) took 17 ms
tableGen(14) took 32 ms
tableGen(15) took 25 ms
tableGen(16) took 31 ms
tableGen(17) took 52 ms
tableGen(18) took 47 ms
tableGen(19) took 27 ms
tableGen(20) took 52 ms
tableGen(21) took 85 ms
tableGen(22) took 203 ms
tableGen(23) took 354 ms
tableGen(24) took 574 ms
tableGen(25) took 1332 ms
tableGen(26) took 3115 ms
tableGen(27) took 5011 ms

Which shows that by the time we get to all 27 rows parsing takes 5 seconds. If you remove asterisks from the data parsing goes back to taking about a millisecond, no matter how many rows. You can do this by changing tableGen(n, includeAsterisks=true) to tableGen(n, includeAsterisks=false) below.

//> using scala "2.13.10"
//> using lib "org.typelevel::cats-effect:3.4.2"
//> using lib "org.planet42::laika-core:0.19.0"

import cats.effect.{IO, IOApp}
import cats.syntax.all._
import laika.api.{MarkupParser, Renderer}
import laika.format.{AST, Markdown}
import laika.markdown.github.GitHubFlavor
import laika.parse.code.SyntaxHighlighting
import laika.parse.markup.DocumentParser.TransformationError

object LaikaParser {

  val parser = MarkupParser.of(Markdown).using(GitHubFlavor, SyntaxHighlighting).build

  val astRenderer = Renderer.of(AST).build

  def transformToUnresolvedAST(
      input: String
  ): Either[TransformationError, String] =
    parser
      .parseUnresolved(input)
      .flatMap(r => astRenderer.render(r.document.content))
}

object LaikaApp extends IOApp.Simple {

  import LaikaParser._

  val raw = """|<tr>
               |  <td scope="row">AED</td>
               |  <td scope="row">GTQ</td>
               |  <td scope="row">PEN</td>
               |</tr>
               |<tr>
               |  <td scope="row">AFN *</td>
               |  <td scope="row">GYD</td>
               |  <td scope="row">PGK</td>
               |</tr>
               |<tr>
               |  <td scope="row">ALL</td>
               |  <td scope="row">HKD</td>
               |  <td scope="row">PHP</td>
               |</tr>
               |<tr>
               |  <td scope="row">AMD *</td>
               |  <td scope="row">HNL</td>
               |  <td scope="row">PKR *</td>
               |</tr>
               |<tr>
               |  <td scope="row">ANG</td>
               |  <td scope="row">HRK</td>
               |  <td scope="row">PLN</td>
               |</tr>
               |<tr>
               |  <td scope="row">AOA</td>
               |  <td scope="row">HTG *</td>
               |  <td scope="row">PYG</td>
               |</tr>
               |<tr>
               |  <td scope="row">ARS *</td>
               |  <td scope="row">HUF</td>
               |  <td scope="row">QAR</td>
               |</tr>
               |<tr>
               |  <td scope="row">AUD</td>
               |  <td scope="row">IDR</td>
               |  <td scope="row">RON</td>
               |</tr>
               |<tr>
               |  <td scope="row">AWG</td>
               |  <td scope="row">ILS</td>
               |  <td scope="row">RSD</td>
               |</tr>
               |<tr>
               |  <td scope="row">AZN *</td>
               |  <td scope="row">INR *</td>
               |  <td scope="row">RUB</td>
               |</tr>
               |<tr>
               |  <td scope="row">BAM</td>
               |  <td scope="row">ISK</td>
               |  <td scope="row">RWF *</td>
               |</tr>
               |<tr>
               |  <td scope="row">BBD</td>
               |  <td scope="row">JMD</td>
               |  <td scope="row">SAR</td>
               |</tr>
               |<tr>
               |  <td scope="row">BDT *</td>
               |  <td scope="row">JPY</td>
               |  <td scope="row">SBD</td>
               |</tr>
               |<tr>
               |  <td scope="row">BGN</td>
               |  <td scope="row">KES</td>
               |  <td scope="row">SCR</td>
               |</tr>
               |<tr>
               |  <td scope="row">BIF</td>
               |  <td scope="row">KGS *</td>
               |  <td scope="row">SEK</td>
               |</tr>
               |<tr>
               |  <td scope="row">BMD</td>
               |  <td scope="row">KHR *</td>
               |  <td scope="row">SGD</td>
               |</tr>
               |<tr>
               |  <td scope="row">BND</td>
               |  <td scope="row">KMF</td>
               |  <td scope="row">SHP</td>
               |</tr>
               |<tr>
               |  <td scope="row">BOB</td>
               |  <td scope="row">KRW</td>
               |  <td scope="row">SLL</td>
               |</tr>
               |<tr>
               |  <td scope="row">BRL</td>
               |  <td scope="row">KYD</td>
               |  <td scope="row">SRD</td>
               |</tr>
               |<tr>
               |  <td scope="row">BSD</td>
               |  <td scope="row">KZT *</td>
               |  <td scope="row">STD *</td>
               |</tr>
               |<tr>
               |  <td scope="row">BWP *</td>
               |  <td scope="row">LAK</td>
               |  <td scope="row">SZL</td>
               |</tr>
               |<tr>
               |  <td scope="row">BZD *</td>
               |  <td scope="row">LBP *</td>
               |  <td scope="row">THB</td>
               |</tr>
               |<tr>
               |  <td scope="row">CAD</td>
               |  <td scope="row">LKR</td>
               |  <td scope="row">TJS</td>
               |</tr>
               |<tr>
               |  <td scope="row">CDF</td>
               |  <td scope="row">LRD *</td>
               |  <td scope="row">TOP</td>
               |</tr>
               |<tr>
               |  <td scope="row">CHF</td>
               |  <td scope="row">LSL *</td>
               |  <td scope="row">TRY *</td>
               |</tr>
               |<tr>
               |  <td scope="row">CLP</td>
               |  <td scope="row">MAD *</td>
               |  <td scope="row">TTD</td>
               |</tr>
               |<tr>
               |  <td scope="row">CNY</td>
               |  <td scope="row">MDL</td>
               |  <td scope="row">TWD</td>
               |</tr>
               |""".stripMargin

  def tableGen(n: Int, includeAsterisks: Boolean): String = {
    val pre =
      """|Here is a table
         |
         |<table>
         |<caption> this is a caption </caption>
         |<tbody>
         |<tr>
         |""".stripMargin
    val post =
      """|
         |</tr>
         |<table>
         |""".stripMargin

    val rows: List[String] = raw.split("\n").sliding(5, 5).map(ls => ls.mkString("\n")).toList
    val fullMd = rows.take(n).mkString(pre, "\n", post)
    if (includeAsterisks) fullMd else fullMd.replaceAll("\\*", "")
  }

  def doIt(n: Int): IO[Unit] = {
    val md   = tableGen(n, includeAsterisks=true)
    val time = IO.delay(transformToUnresolvedAST(md)).timed._1F
    time.flatMap(t => IO.println(s"tableGen($n) took ${t.toMillis} ms"))
  }
  // How many table rows in the input
  val inputs = List.range(10, 28)
  val run    = inputs.traverse_(doIt)
}
jenshalm commented 1 year ago

Oh, that looks like an odd issue and it's the first performance bug report in ages, too.

I think the example is minimal enough and a good starting point to investigate, so thank you very much for providing it.

I also already think I know what is going on (without actually looking at the implementation). The definition for inline-HTML is very flexible in Markdown, HTML tags can contain markup, and markup spans can contain HTML. For this reason the two parsers run in the same phase and in your example each lone asterisk will trigger the search for an emphasized span, but the way it closes in a different cell then means the HTML tag itself never closes (and it will verify that to the very end of the input).

I only explain this in more detail to clarify that this might be open for a bit as it's not a simple performance bug, but something that might require to implement Markdown HTML parsing in a two-phase model which is difficult for the reasons explained above (as the nesting can also be the other way round).

From a purely pragmatic standpoint, I assume you are unable to use GitHub Flavour table syntax for your use case? Because that does not come with any of that complexity as it's proper 2-phase parsing already and the speed should be fine.

valencik commented 1 year ago

That's great context. :)

And yeah I think we have a couple options available to us at work, so I'm not too worried about this being fixed soon.

jenshalm commented 1 year ago

I finally found some time to properly investigate this issue and I'm inclined to close it as "not a bug" as there are several workarounds to avoid the problem and in cases none of the workarounds are applied, it is a genuinely unusual and difficult to parse block where a large number on unclosed spans will be unsuccessfully attempted before falling back to basic text input.

There are also a few issues with the test setup that I initially did not spot when I skimmed over the ticket and there are also subtle oddities in how embedded HTML in Markdown is defined in the original spec, so I'll go over all the aspects one by one.

  1. Independent of the discussion below and as previously mentioned, my no.1 recommendation usually is to avoid embedded HTML in Markdown. For tables there is the table syntax as defined in GitHub Flavour and for other scenarios either directives or renderer overrides might give you the flexibility you need. Embedded HTML in Markdown is usually hard to read and ties the input to an output format. If you really need to use embedded HTML see all the following points.

  2. Embedded HTML is turned off by default in Laika for security reasons and for consistency. In your test setup above you would need to use MarkupParser.of(Markdown).withRawContent.build. Otherwise the result is just a regular Paragraph node. When in doubt it always helps to print out the formatted AST and check it is what you expected.

  3. The Markdown spec makes a distinction between HTML blocks and inline HTML. In case of an HTML block the entire block is a valid HTML element with start and end tag and raw HTML in between. In this case Markdown does not allow embedded markup, it is interpreted solely as raw HTML. In case of inline HTML, some elements within a normal text paragraph, it allows embedded markup in which case the lone asterisks start to kick in and degrade performance. Your use case should not suffer from this as it is intended to be an HTML block, however, it is not well-formed and as such not recognised correctly and parsed as the other option, a mix of interspersed HTML and markup. If you fix the syntax the performance issue goes away (see corrected source below).

  4. Finally, in other cases where you actually need HTML and markup interspersed, you can prevent the excessive backtracking by escaping the asterisks: fullMd.replaceAll("\\*", "\\\\*")

In summary, applying the changes described in 2., together with either 3. or 4. will fix the issue, and the cases where none of the fixes are applied look like an exotic edge cases that should not occur in practice and might be difficult or even infeasible to fix as an unescaped asterisk has to be tried as an emphasised span first.

Corrected setup:

def tableGen(n: Int, includeAsterisks: Boolean): String = {
  val pre  =
      """|Here is a table
         |
         |<table>
         |<caption> this is a caption </caption>
         |<tbody>
         |""".stripMargin
  val post =
      """|
         |</tbody>
         |</table>
         |""".stripMargin
jenshalm commented 1 year ago

I'm closing this now as an unfixable edge case where the quadratic complexity is inherit in the markup, caused by a false positive markup character (an asterisk that is meant to be just a literal) in an overlong paragraph.

In practice this scenario should be very rare (in particular overlong paragraphs are not common). The workaround in such a case would be to escape the asterisk, or in this specific example above, to ensure the block is properly recognised as inline HTML which will ignore any markup characters.

Feel free to reopen this if you think there is still something that could be improved on the library side.