mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.52k stars 864 forks source link

Fixes regex backtracking bug detecting leading/trailing whitespace #427

Closed Rohland closed 1 year ago

Rohland commented 1 year ago

This attempts to fix performance issues with large blocks of text in <code> blocks.

Attempting to convert HTML to markdown on large code blocks would trigger 100% CPU. I believe it's a regex backtracking issue with the following function:

function edgeWhitespace (string) {
  var m = string.match(/^(([ \t\r\n]*)(\s*))[\s\S]*?((\s*?)([ \t\r\n]*))$/);
    return {
      leading: m[1], // whole string for whitespace-only strings
      leadingAscii: m[2],
      leadingNonAscii: m[3],
      trailing: m[4], // empty for whitespace-only strings
      trailingNonAscii: m[5],
      trailingAscii: m[6]
    }
}

This PR refactors this function to avoid the non-greedy search for content in the middle.

I added a test to repro the issue, and prove the fix.

pavelhoral commented 1 year ago

Ah, I love regexp performance :)... The main issue is the content selector. So the regexp can be sped up by simply forcing it to match content to start and end with non-whitespace character:

- var m = string.match(/^(([ \t\r\n]*)(\s*))[\s\S]*?((\s*?)([ \t\r\n]*))$/)
+ var m = string.match(/^(([ \t\r\n]*)(\s*))(?:(?=\S)[\s\S]*\S)?((\s*?)([ \t\r\n]*))$/)

Also maybe a little bit more readable would be do the matching separately (similar to the code in the PR):

function edgeWhitespace (string) {
    var leadMatch = string.match(/^(([ \t\r\n]*)(\s*))/);
    var textMatch = string.substring(leadMatch[0].length).match(/((?=\S)[\s\S]*\S)?/); 
    var tailMatch = string.substring(leadMatch[0].length + textMatch[0].length).match(/((\s*?)([ \t\r\n]*))$/);
    return {
        leading: leadMatch[1], // whole string for whitespace-only strings
        leadingAscii: leadMatch[2],
        leadingNonAscii: leadMatch[3],
        trailing: tailMatch[1], // empty for whitespace-only strings
        trailingNonAscii: tailMatch[2],
        trailingAscii: tailMatch[3]
    };
}

Both these solutions seem to have the same performance as the code proposed by the PR.

martincizek commented 1 year ago

Thank you guys for your contribution. I've created issue #429 to reference the problem.

I've also created a simplified and hopefully more readable version:

function edgeWhitespace (string) {
  var leadingMatch = string.match(/^([ \t\r\n]*)(\s*)/)
  var tailMatch = string.match(/\S(([^\S \t\r\n]*)([ \t\r\n]*))$/) || ['', '', '', '']

  return {
    leading: leadingMatch[0], // whole string for whitespace-only strings
    leadingAscii: leadingMatch[1],
    leadingNonAscii: leadingMatch[2],
    trailing: tailMatch[1], // empty for whitespace-only strings
    trailingNonAscii: tailMatch[2],
    trailingAscii: tailMatch[3]
  }
}

Remarks:

Feedback appreciated.

martincizek commented 1 year ago
  • The non-greedy capture group catching the trailing non-ASCII whitespace is rewritten as a greedy expression [^\S \t\r\n]*. This solution makes two regexp matchings sufficient.

OK, this one has a bug. Will probably use a variant of @pavelhoral's single regexp version, as it will most likely perform better on typical inputs (the function is called for each node - paragraphs, headings, list items, ...).

Rohland commented 1 year ago

Awesome :) Just something to consider is the optimisation for whitespace inputs. We probably want to avoid running the trailing regex if the leading regex matches and the match is the same length as the input string. I don't believe there is value on progressing further and evaluating the trailing regex.

martincizek commented 1 year ago

Released a fix of #429. Thank you for spotting the issue and all the thoughts on the fix.