highlightjs / highlight.js

JavaScript syntax highlighter with language auto-detection and zero dependencies.
https://highlightjs.org/
BSD 3-Clause "New" or "Revised" License
23.74k stars 3.6k forks source link

Syntax error in regular expression (in v11.3.1) #3387

Closed pkaminski closed 3 years ago

pkaminski commented 3 years ago

Describe the issue/behavior that seems buggy Since deploying v11.3.1 (we skipped v11.3.0) in production we've been getting regular reports of our workers failing to load with the error "Syntax error in regular expression", in apparently modern browsers.

Sample Code or Instructions to Reproduce You can try visiting https://reviewable.io to see if this error will reproduce for you but it appears to be quite rare. If it breaks it should do so within seconds of loading the page -- no user interaction needed.

Expected behavior HighlightJS loads without errors for all users of modern browsers.

Additional context The most likely culprit would appear to be the newly added use of Unicode property escapes in the Python grammar, which is included in our custom HighlightJS build. However, as you noted in another issue, this feature should be very well supported in modern browsers.

Strangely, Sentry identifies the browser in all these events as Firefox 78.0 on Windows 10. However, CanIUse claims that Unicode property escapes are supported in Firefox 78, and I installed this exact Firefox version on Windows yet was unable to reproduce the error. I guess it's possible that these users are spoofing their user agent but why would they all settle on FF78/Win10?

Sentry also reports very little overlap in IP addresses affected by this issue: out of 135 events so far, only 3 IP addresses have been repeated (twice each). This seems odd since if Reviewable crashes on load you'd expect the user to immediately attempt to reload the page, and fail again as a script parsing error should be quite consistent. Also, no user has complained about the issue yet, which you would expect by now if over 100 (many paying) were affected.

At this point I'm stumped and just fishing for ideas as to what might be going wrong. Is there a way to confirm that the proximate cause is the Unicode regex / escapes? Has anyone else observed similar issues with the new version? Thanks.

joshgoebel commented 3 years ago

Gonna need more detail. All tests are green.

pkaminski commented 3 years ago

Yeah, I understand that this should work and all the tests I can think of running also pass. Yet it would appear that a non-trivial number of users are seeing crashes. Hence my reaching out to see if 1) you've observed this before / elsewhere, and 2) you have any ideas on how to get more useful data.

joshgoebel commented 3 years ago

Did you add all this detail later? On GitHub on my iPhone I saw almost nothing earlier when i asked for more info. :-)

CanIUse claims that Unicode property escapes are supported in Firefox 78

I wonder if the feature is supported but not the syntax?

and I installed this exact Firefox version on Windows yet was unable to reproduce the error.

I assume when you say "reproduce" here you mean visit your own website and see if it breaks?

Is there a way to confirm that the proximate cause is the Unicode regex / escapes?

Build a version of Highlight.js without Python support. (though this is soon to change, but should be good for now I think)

Has anyone else observed similar issues with the new version?

Not that I'm aware of.

pkaminski commented 3 years ago

Did you add all this detail later? On GitHub on my iPhone I saw almost nothing earlier when i asked for more info. :-)

Nope, no edits. Blame GitHub!

I wonder if the feature is supported but not the syntax?

I can successfully evaluate the actual regex expression (/[\p{XID_Start}_]\p{XID_Continue}*/u) in the console in FF78 on Win10, so I don't think that's it. Unless it has different parsing modes and selects the wrong (legacy?) one for the worker source code sometimes?

I assume when you say "reproduce" here you mean visit your own website and see if it breaks?

Yep, tried in both prod and in a dev environment.

Are you loading all the grammars at once or one at a time via some sort of load on demand system?

We have a custom fat build of HighlightJS where we load everything at once (well, almost, terraform support comes in as a separate file):

node tools/build.js actionscript bash c cpp capnproto clojure cmake coffeescript cpp csharp css dart dockerfile elixir elm erlang fsharp go gradle groovy haskell handlebars ini java javascript json julia kotlin latex lisp lua makefile markdown mathematica objectivec ocaml perl php powershell properties protobuf python r ruby rust scala scheme scss sql stylus swift tcl twig typescript vbnet vbscript xml yaml

Build a version of Highlight.js without Python support. (though this is soon to change, but should be good for now I think)

Yeah, problem is that Python is one of our most popular languages, so pushing a version to prod without Python support would be a hard sell. Especially as we'd need to leave it there for an hour or two at least to make sure it's not breaking, as the errors are quite intermittent. I could instead revert to a pre-11.3 build for a bit but would that be too coarse?

joshgoebel commented 3 years ago

I could instead revert to a pre-11.3 build for a bit but would that be too coarse?

I don't know what that would prove (or how it would be helpful)... unless your'e not certain the errors are related to 11.3...

joshgoebel commented 3 years ago

I can successfully evaluate the actual regex expression (/[\p{XIDStart}]\p{XID_Continue}*/u) in the console in FF78 on Win10, so I don't think that's it.

Could be something entirely different of course (we could be barking up entirely the wrong tree), but you'd really need a line number or some other debugging to confirm...

pkaminski commented 3 years ago

Could be something entirely different of course (we could be barking up entirely the wrong tree), but you'd really need a line number or some other debugging to confirm...

Sentry points to this area in the minified code so I think it's a fair bet, and it's the only regex-related change I can think of in (our) release when we started seeing the errors:

{snip} tion(a){var b=a.regex,c=/[\p{XID_Start}_]\p{XID_Continue}*/u,d={$pattern:/[A-Za-z]\w+|__\w+__/,keyword:["and","as","assert","async","await", {snip}

I might try to delay capturing the error until we've resumed the user's session so that I can get some usernames to follow up with.

joshgoebel commented 3 years ago

I think it's a fair bet

Yeah, just covering all the bases... perhaps Firefox 78 shipped different versions though or we're talking beta vs release? Since Firefox 78 is the first supported version easy to believe there are "edge cases" or that the support wasn't "fully baked" somehow...

https://wiki.mozilla.org/Releases/Firefox_78

In any case from our POV if there was some weird issue with Firefox 78 I'd simply say we don't support it and those users need to update... for browsers that support auto-updating generally we expect MOST users to be on the last few releases... the answer to someone 12 or 13 releases behind having issues is definitely: "please upgrade". Plus there are no doubt tons of security issues and reasons to upgrade from a browser that old.

Let us know if you learn more.

pkaminski commented 3 years ago

Yep, I'm with you in general. The only reason I'm even considering supporting FF78 is because it was the 2020 ESR release and the 2021 ESR (FF91) is only ~3 months old still.

I'll see what else I can uncover and follow up here, but this is almost certainly an invalid / won't fix resolution from your side.

joshgoebel commented 3 years ago

it was the 2020 ESR release and the 2021 ESR (FF91) is only ~3 months old still.

Ah, I think I was less aware of these ESR releases, that's interesting to know, but correct, probably a wont fix if this boils down to just the /u flag... right now someone could always use an older 11 release or even the latest release with an older python grammar (that doesn't use /u flag or features)... of course long-term we'll see more UTF-8 regex naturally...

pkaminski commented 3 years ago

After further investigation I think this issue is caused by a stealthed crawler using a broken client and can be safely ignored. The evidence:

  1. Every user agent claims to be FF78/Win10 but the error doesn't reproduce there.
  2. Every event also has a ReferenceError: 'TextEncoder' is undefined error at the beginning of the breadcrumbs. Firefox supported this since v20.
  3. A large sample of IPs all trace back to hosting / infrastructure companies, and none to residential blocks AFAICT.
  4. No client is signed in, and no actual user has complained.
  5. Some of the requests are for a broken URL that indicates somebody parsed a Markdown link incorrectly.

Hopefully recording this here will help someone in the future!

joshgoebel commented 3 years ago

@pkaminski Awesome! Glad its' just some stupid bots and none of your real users!

firasdib commented 2 years ago

This is happening to some of my users running older browsers. The regex that causes this is /[\p{XID_Start}_]\p{XID_Continue}*/u in the Python flavor. Could this regex be replaced with one that is more backwards compatible?

joshgoebel commented 2 years ago

No, sorry. It's supported by all green-field browsers and has been for quite some time:

https://caniuse.com/?search=unicode%20regex

We did wait quite some time before starting to use this feature, but there will always be people with old browsers - they need to upgrade.

joshgoebel commented 2 years ago

Also, it's not just Python, see #2756. It will be more and more grammars as people find the time to patch them to better support UTF-8.

firasdib commented 2 years ago

Understood, thank you for the response. I agree that people should keep their browsers up to date, but June 2020 (Firefox) isn't necessary that old.

joshgoebel commented 2 years ago

Ah, grrr... \p is newer indeed, your are correct. I didn't realize this at the time. Yet Firefox does monthly releases, they are free to download, and most system can and should auto-update, so that is still 18 releases behind at this point, which is forever. Given that this already has been merged and I don't think we want to revert it to support users who are 18 releases behind the latest release.

joshgoebel commented 2 years ago

CC @highlightjs/core Any differing opinions?

I'm also taking as a data point that only 2 people have filed any issues about this...

firasdib commented 2 years ago

@joshgoebel In the past 4 weeks, I've only had ~20 people (that I know of) encounter this issue on my website. So it's not a huge amount affected, but still saddens me that they are unable to utilize the website due to this error.

joshgoebel commented 2 years ago

@pkaminski Could you confirm you're talking about a parse time error vs runtime error, yes?

allejo commented 2 years ago

In the past 4 weeks, I've only had ~20 people (that I know of) encounter this issue on my website.

Do you have the user agent strings for these users? Are they all using FF78 as well?

I'm having an annoyingly hard time finding an official Mozilla doc specifying the EOL for FF ESR 78 but according to endoflife.date/firefox, the EOL is 2021-11-02; it's now officially unsupported by Mozilla. I don't want to increase the maintenance load of unsupported browsers on Josh, who's effectively the sole maintainer of the core library.

+1 for keeping Unicode support with \p. Reverting this would be a step backward for the library IMO.

still saddens me that they are unable to utilize the website due to this error.

But... This sounds like a fatal error and causes the library to not work once a browser hits this. Two options:

  1. Have the website owner revert bb4fb2291bcd7323e0e577b9abc00e89f2e2cf14 locally and compile your own bundle using an older version of Python.
  2. Hypothetically, a patch could do something along the lines of converting the \p usage to a string so that it becomes a runtime error and not a syntax error. Then have a try/catch error somewhere to swallow this runtime error. But this is definitely not something I would want to be committed into core, website owners would have to patch their libraries themselves. This is just a fix so that the library loads but highlighting will still likely break.
joshgoebel commented 2 years ago

Or just patch the IDENT_RE constant with the one that was used before. UNDERSCORE_IDENT_RE from modes.js. A one line patch. Of course as we add more \p in the future (which we plan to) you'd need to deal with those as they come up, etc...


the EOL is 2021-11-02; it's now officially unsupported by Mozilla.

I do think there is something to be said for not encouraging the use of outdated browsers - which quite likely have security vulnerabilities and possibly causing more harm than good in the world. If someone is in the enterprise or has a business use case for supporting them, then they should bear the burden of that ongoing support cost, not the core library/team. Again, I say that with context: Firefox is free to update and can auto-update. This is a problem the user can easily solve themselves if they see their fav website breaking.

For most personal sites I think it's fine to simply not support this tiny (and shrinking) group of users.

firasdib commented 2 years ago

Do you have the user agent strings for these users? Are they all using FF78 as well?

Mix of FF-versions prior to 78 and "Pale Moon" (whatever that is).

But... This sounds like a fatal error and causes the library to not work once a browser hits this. Two options:

This is a parse time error. As soon as the JS engine interprets the regex, it throws an error.

1. Have the website owner revert [bb4fb22](https://github.com/highlightjs/highlight.js/commit/bb4fb2291bcd7323e0e577b9abc00e89f2e2cf14) locally and compile your own bundle using an older version of Python.

Sounds reasonable.

2. Hypothetically, a patch could do something along the lines of converting the `\p` usage to a string so that it becomes a runtime error and not a syntax error. Then have a try/catch error somewhere to swallow this runtime error. But this is definitely not something I would want to be committed into core, website owners would have to patch their libraries themselves. This is just a fix so that the library loads but highlighting will still likely break.

Could be worth protecting these more esoteric expressions in some sort of safe guard, to avoid these types of errors.

But to be clear, you don't necessarily have to do anything about it. For me, regex101 has thousands of visitors every day, and I've only racked up a handful error reports over the past couple of weeks. If there is no easy/simple fix that is not counterproductive to the library/devs, I would urge you to let it be.

pkaminski commented 2 years ago

@pkaminski Could you confirm you're talking about a parse time error vs runtime error, yes?

Yes, this was a parse time error for me.

joshgoebel commented 2 years ago

@pkaminski Curious, what steps - if any - did you take (on your own) to resolve, or are you just telling those users to upgrade?

pkaminski commented 2 years ago

As reported above, I believe all (or nearly all) of the errors were from bots, hence no action was needed on my part. I am seeing occasional errors stemming from old mobile browsers, etc., but I ignore those as 1) mobile is not a primary target for my app and 2) my target audience of developers ought to know that using old browsers will cause sites to break and they should upgrade.

joshgoebel commented 2 years ago

Could be worth protecting these more esoteric expressions in some sort of safe guard,

I gave it a moments thought, but that leads to a rabbit hole of other things to consider - and all to support a tiny shrinking % who really should just upgrade. Other major browsers seems to have supported this since 2018 (early 2020 for Edge). Just seems Firefox was VERY late to the party on this one.

xianshenglu commented 1 year ago

Hi, just share my information which might help someone.

I'm using Angular@15 and highlight.js@11.7.0. I met this

"Invalid regular expression: //[A-Z_a-z\xAA\xB5\xBA\xC0-\x

because of highlighting Python code and updating from Angular@14.

My solution is to change .browserslistrc from

chrome > 60

to

chrome > 63

That's done!

Another reason is the usage of this code in python.js

const IDENT_RE = /[\p{XID_Start}_]\p{XID_Continue}*/u;

It will not work with chrome > 62 in .browserslistrc with Angular > 14.