Closed ZainChen closed 4 years ago
Hello, the problem is solved because the character encoding is wrong.
In the highlight.pack.js file, the following code does not recognize: [{b:/[a-zA-Zа-яА-я]+[*]?/}
Need to add <meta charset="UTF-8">
Looks like we need to fix the regex to use explicit Unicode code points. Unfortunately, meta charset is a HTML only thing, not a JavaScript thing - but you are right that it’s an encoding problem.
I am very happy to have this accidental discovery. I am using highlightjs, I like it very much, it is very easy to use, you are too powerful! This is my effect: https://www.cnblogs.com/chenzhiyin/p/zain3.html
Possibly helpful:
% find src | xargs file | grep UTF-8 2.6.5
src/languages/routeros.js: UTF-8 Unicode text, with very long lines
src/languages/cos.js: UTF-8 Unicode text
src/languages/apache.js: UTF-8 Unicode text
src/languages/avrasm.js: UTF-8 Unicode text
src/languages/fsharp.js: UTF-8 Unicode text
src/languages/capnproto.js: UTF-8 Unicode text
src/languages/objectivec.js: UTF-8 Unicode text
src/languages/1c.js: UTF-8 Unicode text
src/languages/nix.js: UTF-8 Unicode text
src/languages/bnf.js: UTF-8 Unicode text
src/languages/julia.js: UTF-8 Unicode text
src/languages/jboss-cli.js: UTF-8 Unicode text
src/languages/n1ql.js: UTF-8 Unicode text
src/languages/less.js: Algol 68 source text, UTF-8 Unicode text
src/languages/arduino.js: Algol 68 source text, UTF-8 Unicode text
src/languages/basic.js: UTF-8 Unicode text
src/languages/css.js: UTF-8 Unicode text
src/languages/sqf.js: UTF-8 Unicode text
src/languages/isbl.js: UTF-8 Unicode text
src/languages/coffeescript.js: UTF-8 Unicode text
src/languages/makefile.js: UTF-8 Unicode text
src/languages/python.js: UTF-8 Unicode text
src/languages/dockerfile.js: UTF-8 Unicode text
Thought this may have false positives since it's going to pick up things like author names:
Author: Raphaël Assénat raph@raphnet.net
Really I think we only need to fix the code... since if we're packaging this up for packed or CDN comment should really be removed anyways.
1c.... oh boy. Maybe there is some sort of tool that could do this for us as part of the build pipeline? Does sound kind of like a computer type of problem.
Is anyone actually still having a problem with this issue?
@ZainChen
Closing this because it no longer seems to be an issue.
FYI, I just ran into this issue with the latest version of highlight.js. Fixed it in webpack based on the following article: https://medium.com/@thinkpanda_75045/webpack-with-unicode-7a2c5eb22afd
Ran into it in what context?
I think for a long time our answer has been: "serve it with the proper headers and it just works". Are you having a problem in the usage, or the build pipeline itself? Are you using the npm
library?
I installed highlight.js v10.2.1 via npm and then bundled it with the rest of my JS using webpack. The build pipeline worked fine. However, when I then ran my app, I got the following error in my browser:
Uncaught SyntaxError: Invalid regular expression: Range out of order in character class
The browser in this case is an iOS web view (WKWebView) where I serve the JS bundle from a local file. So I don't think I have an opportunity to set a UTF-8 content header anywhere.
My fix was to customize the Terser minifier I use in my build pipeline to output ascii only, ensuring proper encoding. This successfully removed the error at runtime.
So I don't think I have an opportunity to set a UTF-8 content header anywhere.
In the HTML you're generating for your webview, no? If the BOM field is present at the beginning of the file, does that work - any idea?
My fix was to customize the Terser minifier I use in my build pipeline to output ascii only
We use terser for our web packaging... if it's just a simple flag we might consider just building our web distributable with that enabled... but that wouldn't help you since you're using NPM... I suppose we could perhaps also terse those (just without all the compression stuff, but I have no experience using terser in this fashion).
I'm also curious what others are doing that this can go 2 years without anyone bumping into it as an issue...
So I attempted to set the charset in the webview html file head like the following, but it didn't do anything:
<head>
<meta charset="UTF-8">
</head>
But is there a way to set a charset on the actual script include? I'm currently using the following:
<script type="text/javascript" src="bundle.js"></script>
As for the Terser plugin, it is just a simple flag. I just added the following to my webpack.prod.config:
optimization: {
minimize: true,
minimizer: [
new TerserPlugin({
terserOptions: { output: { ascii_only: true } }
})
]
},
I'm not sure about whether you can use Terser for re-encoding without the minification.
It's mentioned briefly in the README:
<script charset="UTF-8"
As for the Terser plugin, it is just a simple flag.
I just tried it. It increased the gzipped size of our default build by 50%... I'm assuming all that encoded UTF-8 does not compress very well (though I'm not even sure what file is to blame, it does seem high). So I'm not feeling it unless someone wanted to do more research. I'm not sure there is anything to fix regarding the NPM modules though (perhaps a doc fix, but I'm not sure what)... UTF-8 is plenty valid inside JS... so the solution is to specify charset
or if you have a really strange env that couldn't handle UTF-8 at all I support something like you're doing now with ascii_only... that shouldn't be necessary though if you add the charset directive.
terser: {
"format": {
"ascii_only": true,
},
"compress": {
passes: 2,
unsafe: true,
warnings: true,
dead_code: true,
toplevel: "funcs"
}
}
Ok for some reason acsii_only
entirely turns off compression, no idea what's up with that...
Sorry, I must have missed that in the docs. Adding <script charset="UTF-8" fixed it for me so I can remove the TerserPlugin ascii_only output!
I agree, looks like there is nothing to do then. Appreciate the quick feedback!
Ha, needed to bump our version... looks like we'll be switching to ascii_only
for our browser/CDN builds if this causes no issues. Thanks for bringing this to my attention! :)
Hello, download it at https://highlightjs.org/download/, the google browser will report an error: Uncaught SyntaxError: Uncaught SyntaxError: Invalid regular expression: /[a-zA-Z邪-褟袗-褟]+[*]?/: Range out of order in character class at highlight.pack_all.js:5 at Object.N [as registerLanguage] (highlight.pack_all.js:2) at highlight.pack_all.js:5