mozilla / rhino

Rhino is an open-source implementation of JavaScript written entirely in Java
https://rhino.github.io
Other
4.15k stars 846 forks source link

Encoding error with local .js files after openJDK 17.0.3 #1232

Closed midgleyc closed 3 months ago

midgleyc commented 2 years ago

In https://bugs.openjdk.java.net/browse/JDK-8273655, a change was made so that ".js" files now are seen as having content type "text/javascript".

This leads to a change in the UrlModuleSourceProvider where local ".js" files (or any ".js" files without a Content-Type encoding) are interpreted as having a character encoding of "8859_1" while previously they were interpreted as having a character encoding of "utf-8".

This has lead to a problem with KoLMafia's ability to parse javascript scripts, as discussed in this thread.

I've put together a repo at java-probe-content-type which shows the difference for local files.

I don't know what a good fix to this would look like (as assuming files without an explicit encoding are cp-1252 is as the spec says), but it might be a quicker fix if getCharacterEncoding were protected instead of private static. That way we could override it on our end to expect utf-8 instead of 8859_1 without affecting other uses of this library.

p-bakker commented 2 years ago

One way to solve this locally without any code changes is overiding the content-types.properties file used by setting the content.types.user.table system property to point to a custom file

midgleyc commented 2 years ago

Filed #1233 to make getCharacterEncoding protected. Thinking more I think Rhino can fix the problem on its end: we only have the issue for local files, and you can use url.getProtocol() to determine whether the file is local or not (if local, the protocol is file). The text/ adjustment should likely only be done if the file comes over the HTTP protocol, so

if (contentType != null && contentType.startsWith("text/")) {
    return "8859_1";
}

could be replaced with

String protocol = url.getProtocol();
boolean isHttp = protocol != null && (protocol.equals("http") || protocol.equals("https"));
if (isHttp && contentType != null && contentType.startsWith("text/")) {
    return "8859_1";
}

Does this seem reasonable?

p-bakker commented 2 years ago

I haven't found any documentation that local files in CommonJS ought to be stored in utf-8. However, it seems to me that NodeJS expects this.

@botic would you have more insight, given that RingoJS supports CommonJS as well?

So imho I'd think that defaulting to utf-8 for local files might be the easiest.

Note that the defaulting to 8859_1 is basically wrong as well: that used to be the default in http, but has been removed from the spec

And nowadays the BOM ought to be checked (even for utf8 where a BOM is optional)

botic commented 2 years ago

In Ringo the module loading process uses the charset specified in the Ringo config. Every module load will enforce this charset. Default is utf-8 for all text-based resources.

alinbrici46 commented 1 year ago

Anyone have any idea when this will be fixed?

midgleyc commented 3 months ago

Fix (workaround) released with 1.7.15.