Closed Rohland closed 1 year ago
Ah, I love regexp performance :)... The main issue is the content selector. So the regexp can be sped up by simply forcing it to match content to start and end with non-whitespace character:
- var m = string.match(/^(([ \t\r\n]*)(\s*))[\s\S]*?((\s*?)([ \t\r\n]*))$/)
+ var m = string.match(/^(([ \t\r\n]*)(\s*))(?:(?=\S)[\s\S]*\S)?((\s*?)([ \t\r\n]*))$/)
Also maybe a little bit more readable would be do the matching separately (similar to the code in the PR):
function edgeWhitespace (string) {
var leadMatch = string.match(/^(([ \t\r\n]*)(\s*))/);
var textMatch = string.substring(leadMatch[0].length).match(/((?=\S)[\s\S]*\S)?/);
var tailMatch = string.substring(leadMatch[0].length + textMatch[0].length).match(/((\s*?)([ \t\r\n]*))$/);
return {
leading: leadMatch[1], // whole string for whitespace-only strings
leadingAscii: leadMatch[2],
leadingNonAscii: leadMatch[3],
trailing: tailMatch[1], // empty for whitespace-only strings
trailingNonAscii: tailMatch[2],
trailingAscii: tailMatch[3]
};
}
Both these solutions seem to have the same performance as the code proposed by the PR.
Thank you guys for your contribution. I've created issue #429 to reference the problem.
I've also created a simplified and hopefully more readable version:
function edgeWhitespace (string) {
var leadingMatch = string.match(/^([ \t\r\n]*)(\s*)/)
var tailMatch = string.match(/\S(([^\S \t\r\n]*)([ \t\r\n]*))$/) || ['', '', '', '']
return {
leading: leadingMatch[0], // whole string for whitespace-only strings
leadingAscii: leadingMatch[1],
leadingNonAscii: leadingMatch[2],
trailing: tailMatch[1], // empty for whitespace-only strings
trailingNonAscii: tailMatch[2],
trailingAscii: tailMatch[3]
}
}
Remarks:
[^\S \t\r\n]*
. This solution makes two regexp matchings sufficient.Feedback appreciated.
- The non-greedy capture group catching the trailing non-ASCII whitespace is rewritten as a greedy expression
[^\S \t\r\n]*
. This solution makes two regexp matchings sufficient.
OK, this one has a bug. Will probably use a variant of @pavelhoral's single regexp version, as it will most likely perform better on typical inputs (the function is called for each node - paragraphs, headings, list items, ...).
Awesome :) Just something to consider is the optimisation for whitespace inputs. We probably want to avoid running the trailing regex if the leading regex matches and the match is the same length as the input string. I don't believe there is value on progressing further and evaluating the trailing regex.
Released a fix of #429. Thank you for spotting the issue and all the thoughts on the fix.
This attempts to fix performance issues with large blocks of text in
<code>
blocks.Attempting to convert HTML to markdown on large code blocks would trigger 100% CPU. I believe it's a regex backtracking issue with the following function:
This PR refactors this function to avoid the non-greedy search for content in the middle.
I added a test to repro the issue, and prove the fix.