googlearchive / caja

Caja is a tool for safely embedding third party HTML, CSS and JavaScript in your website.
Apache License 2.0
1.13k stars 113 forks source link

csslexer lockup the browser #2037

Open renvieir opened 5 years ago

renvieir commented 5 years ago

The lexCss takes too much time to perform a regex match an thus is blocking the UI.

The code snippet above is from line 239 applied to a font-family property.

const regex = /\uFEFF|U[+][0-9A-F?]{1,6}(?:-[0-9A-F]{1,6})?|url[(][\t\n\f ]*(?:"(?:'|[^'"\n\f\\]|\\[\s\S])*"|'(?:"|[^'"\n\f\\]|\\[\s\S])*'|(?:[\t\x21\x23-\x26\x28-\x5b\x5d-\x7e]|[\u0080-\ud7ff\ue000-\ufffd]|[\ud800-\udbff][\udc00-\udfff]|\\(?:[0-9a-fA-F]{1,6}[\t\n\f ]?|[\u0020-\u007e\u0080-\ud7ff\ue000\ufffd]|[\ud800-\udbff][\udc00-\udfff]))*)[\t\n\f ]*[)]|(?!url[(])-?(?:[a-zA-Z_]|[\u0080-\ud7ff\ue000-\ufffd]|[\ud800-\udbff][\udc00-\udfff]|\\(?:[0-9a-fA-F]{1,6}[\t\n\f ]?|[\u0020-\u007e\u0080-\ud7ff\ue000\ufffd]|[\ud800-\udbff][\udc00-\udfff]))(?:[a-zA-Z0-9_-]|[\u0080-\ud7ff\ue000-\ufffd]|[\ud800-\udbff][\udc00-\udfff]|\\(?:[0-9a-fA-F]{1,6}[\t\n\f ]?|[\u0020-\u007e\u0080-\ud7ff\ue000\ufffd]|[\ud800-\udbff][\udc00-\udfff]))*[(]|(?:@?-?(?:[a-zA-Z_]|[\u0080-\ud7ff\ue000-\ufffd]|[\ud800-\udbff][\udc00-\udfff]|\\(?:[0-9a-fA-F]{1,6}[\t\n\f ]?|[\u0020-\u007e\u0080-\ud7ff\ue000\ufffd]|[\ud800-\udbff][\udc00-\udfff]))|#)(?:[a-zA-Z0-9_-]|[\u0080-\ud7ff\ue000-\ufffd]|[\ud800-\udbff][\udc00-\udfff]|\\(?:[0-9a-fA-F]{1,6}[\t\n\f ]?|[\u0020-\u007e\u0080-\ud7ff\ue000\ufffd]|[\ud800-\udbff][\udc00-\udfff]))*|"(?:'|[^'"\n\f\\]|\\[\s\S])*"|'(?:"|[^'"\n\f\\]|\\[\s\S])*'|[-+]?(?:[0-9]+(?:[.][0-9]+)?|[.][0-9]+)(?:%|-?(?:[a-zA-Z_]|[\u0080-\ud7ff\ue000-\ufffd]|[\ud800-\udbff][\udc00-\udfff]|\\(?:[0-9a-fA-F]{1,6}[\t\n\f ]?|[\u0020-\u007e\u0080-\ud7ff\ue000\ufffd]|[\ud800-\udbff][\udc00-\udfff]))(?:[a-zA-Z0-9_-]|[\u0080-\ud7ff\ue000-\ufffd]|[\ud800-\udbff][\udc00-\udfff]|\\(?:[0-9a-fA-F]{1,6}[\t\n\f ]?|[\u0020-\u007e\u0080-\ud7ff\ue000\ufffd]|[\ud800-\udbff][\udc00-\udfff]))*)?|<!--|-->|[\t\n\f ]+|\/(?:[*][^*]*[*]+(?:[^/][^*]*[*]+)*\/|\/[^\n\f]*)|[~|^$*]=|[^"'\\/]|\/(?![/*])/gi;
const str = `font-family: \\00E5\\00BE\\00AE\\00E8\\00BD\\00AF\\00E9\\203A\\2026\\00E9\\00BB\\2018;`;
let m;

const startTime = performance.now();
str.match(regex);
const endTime = performance.now();
console.warn(`It took ${(endTime - startTime)/1000} seconds to execute`);

One single execution can take ~10 seconds

codeworrior commented 4 years ago

We got a customer incident for the same issue (in the context of UI5).

I analysed it and think it is an example of catastrophic backtracking. The regex in the csslexer is composed of several sub-expressions, one of them for FUNCTION:

var FUNCTION = (?!url[(])' + IDENT + '[(]';

For illustration purposes, I've expanded IDENT. Then FUNCTION looks like

var FUNCTION = (?!url[(])' + '-?' + NMSTART + NMCHAR + '*' + '[(]';

The combination of the repetition of NMCHAR* and the succeeding '(' makes this a perfect candidate for backtracking. Plus, NMSTART and NMCHAR both allow alternative interpretations of digits when they occur after a backslash (either as part of a UNICODE escape sequence or as an ASCII char). Together, this builds the ground for catastrophic backtracking with exponential runtime.

Luckily, Regex: Emulate Atomic Grouping (and Possessive Quantifiers) with LookAhead describes a solution to this known issue of regular expressions.

Applying the proposed pattern to the FUNCTION sub-expression seems to fix the performance problem:

var FUNCTION = (?!url[(])' + '(?=(' + IDENT + '))\\1' + '[(]';

Further testing is needed (e.g. whether the mandatory capturing group causes negative side-effects), and cross-browser support is a topic. The lookahead ?= as well as the back reference \1 might not be supported everywhere.