highlightjs / highlight.js

JavaScript syntax highlighter with language auto-detection and zero dependencies.
https://highlightjs.org/
BSD 3-Clause "New" or "Revised" License
23.8k stars 3.61k forks source link

Update all files to use unicode code points rather than literal UTF-8 values #1901

Closed ZainChen closed 4 years ago

ZainChen commented 6 years ago

Hello, download it at https://highlightjs.org/download/, the google browser will report an error: Uncaught SyntaxError: Uncaught SyntaxError: Invalid regular expression: /[a-zA-Z邪-褟袗-褟]+[*]?/: Range out of order in character class at highlight.pack_all.js:5 at Object.N [as registerLanguage] (highlight.pack_all.js:2) at highlight.pack_all.js:5

ZainChen commented 6 years ago

Hello, the problem is solved because the character encoding is wrong. In the highlight.pack.js file, the following code does not recognize: [{b:/[a-zA-Zа-яА-я]+[*]?/} Need to add <meta charset="UTF-8">

marcoscaceres commented 6 years ago

Looks like we need to fix the regex to use explicit Unicode code points. Unfortunately, meta charset is a HTML only thing, not a JavaScript thing - but you are right that it’s an encoding problem.

ZainChen commented 6 years ago

I am very happy to have this accidental discovery. I am using highlightjs, I like it very much, it is very easy to use, you are too powerful! This is my effect: https://www.cnblogs.com/chenzhiyin/p/zain3.html

joshgoebel commented 5 years ago

Possibly helpful:

% find src | xargs file | grep UTF-8                                                                                  2.6.5
src/languages/routeros.js:                UTF-8 Unicode text, with very long lines
src/languages/cos.js:                     UTF-8 Unicode text
src/languages/apache.js:                  UTF-8 Unicode text
src/languages/avrasm.js:                  UTF-8 Unicode text
src/languages/fsharp.js:                  UTF-8 Unicode text
src/languages/capnproto.js:               UTF-8 Unicode text
src/languages/objectivec.js:              UTF-8 Unicode text
src/languages/1c.js:                      UTF-8 Unicode text
src/languages/nix.js:                     UTF-8 Unicode text
src/languages/bnf.js:                     UTF-8 Unicode text
src/languages/julia.js:                   UTF-8 Unicode text
src/languages/jboss-cli.js:               UTF-8 Unicode text
src/languages/n1ql.js:                    UTF-8 Unicode text
src/languages/less.js:                    Algol 68 source text, UTF-8 Unicode text
src/languages/arduino.js:                 Algol 68 source text, UTF-8 Unicode text
src/languages/basic.js:                   UTF-8 Unicode text
src/languages/css.js:                     UTF-8 Unicode text
src/languages/sqf.js:                     UTF-8 Unicode text
src/languages/isbl.js:                    UTF-8 Unicode text
src/languages/coffeescript.js:            UTF-8 Unicode text
src/languages/makefile.js:                UTF-8 Unicode text
src/languages/python.js:                  UTF-8 Unicode text
src/languages/dockerfile.js:              UTF-8 Unicode text
joshgoebel commented 5 years ago

Thought this may have false positives since it's going to pick up things like author names:

Author: Raphaël Assénat raph@raphnet.net

Really I think we only need to fix the code... since if we're packaging this up for packed or CDN comment should really be removed anyways.

joshgoebel commented 5 years ago

1c.... oh boy. Maybe there is some sort of tool that could do this for us as part of the build pipeline? Does sound kind of like a computer type of problem.

joshgoebel commented 4 years ago

Is anyone actually still having a problem with this issue?

@ZainChen

joshgoebel commented 4 years ago

Closing this because it no longer seems to be an issue.

sachinrekhi commented 4 years ago

FYI, I just ran into this issue with the latest version of highlight.js. Fixed it in webpack based on the following article: https://medium.com/@thinkpanda_75045/webpack-with-unicode-7a2c5eb22afd

joshgoebel commented 4 years ago

Ran into it in what context?

I think for a long time our answer has been: "serve it with the proper headers and it just works". Are you having a problem in the usage, or the build pipeline itself? Are you using the npm library?

sachinrekhi commented 4 years ago

I installed highlight.js v10.2.1 via npm and then bundled it with the rest of my JS using webpack. The build pipeline worked fine. However, when I then ran my app, I got the following error in my browser:

Uncaught SyntaxError: Invalid regular expression: Range out of order in character class

The browser in this case is an iOS web view (WKWebView) where I serve the JS bundle from a local file. So I don't think I have an opportunity to set a UTF-8 content header anywhere.

My fix was to customize the Terser minifier I use in my build pipeline to output ascii only, ensuring proper encoding. This successfully removed the error at runtime.

joshgoebel commented 4 years ago

So I don't think I have an opportunity to set a UTF-8 content header anywhere.

In the HTML you're generating for your webview, no? If the BOM field is present at the beginning of the file, does that work - any idea?

My fix was to customize the Terser minifier I use in my build pipeline to output ascii only

We use terser for our web packaging... if it's just a simple flag we might consider just building our web distributable with that enabled... but that wouldn't help you since you're using NPM... I suppose we could perhaps also terse those (just without all the compression stuff, but I have no experience using terser in this fashion).

I'm also curious what others are doing that this can go 2 years without anyone bumping into it as an issue...

sachinrekhi commented 4 years ago

So I attempted to set the charset in the webview html file head like the following, but it didn't do anything:

<head>
    <meta charset="UTF-8">
</head>

But is there a way to set a charset on the actual script include? I'm currently using the following:

<script type="text/javascript" src="bundle.js"></script>

As for the Terser plugin, it is just a simple flag. I just added the following to my webpack.prod.config:

    optimization: {
        minimize: true,
        minimizer: [
            new TerserPlugin({
                terserOptions: { output: { ascii_only: true } }
            })
        ]
    },

I'm not sure about whether you can use Terser for re-encoding without the minification.

joshgoebel commented 4 years ago

It's mentioned briefly in the README:

<script charset="UTF-8"

As for the Terser plugin, it is just a simple flag.

I just tried it. It increased the gzipped size of our default build by 50%... I'm assuming all that encoded UTF-8 does not compress very well (though I'm not even sure what file is to blame, it does seem high). So I'm not feeling it unless someone wanted to do more research. I'm not sure there is anything to fix regarding the NPM modules though (perhaps a doc fix, but I'm not sure what)... UTF-8 is plenty valid inside JS... so the solution is to specify charset or if you have a really strange env that couldn't handle UTF-8 at all I support something like you're doing now with ascii_only... that shouldn't be necessary though if you add the charset directive.

joshgoebel commented 4 years ago
    terser: {
      "format": {
        "ascii_only": true,
      },
        "compress": {
          passes: 2,
          unsafe: true,
          warnings: true,
          dead_code: true,
          toplevel: "funcs"
        }
    }

Ok for some reason acsii_only entirely turns off compression, no idea what's up with that...

sachinrekhi commented 4 years ago

Sorry, I must have missed that in the docs. Adding <script charset="UTF-8" fixed it for me so I can remove the TerserPlugin ascii_only output!

I agree, looks like there is nothing to do then. Appreciate the quick feedback!

joshgoebel commented 4 years ago

Ha, needed to bump our version... looks like we'll be switching to ascii_only for our browser/CDN builds if this causes no issues. Thanks for bringing this to my attention! :)