nahsra / antisamy

a library for performing fast, configurable cleansing of HTML coming from untrusted sources
BSD 3-Clause "New" or "Revised" License
186 stars 89 forks source link

fix(#456): fix default charset problem #457

Open GodMeowIceSun opened 3 months ago

GodMeowIceSun commented 3 months ago

the fix of this issue

456

davewichers commented 3 months ago

@spassarop - Have you looked at this issue/proposed fix yet? If there is a real problem here it would be nice to fix this and include it in the release we are close to getting done related to the neko-html fix.

kwwall commented 3 months ago

ISO-8859-1 is the default encoding for Windows, but just about every other modern OS uses UTF-8. It can especially bite you in the butt when using String.getBytes(). I recommend changing it to use UTF-8.

See https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1#7048774 for details of how they are different.

-kevin

On Wed, Jun 5, 2024, 11:52 PM GodMeowIceSun @.***> wrote:

@GodMeowIceSun https://github.com/GodMeowIceSun requested your review on: #457 https://github.com/nahsra/antisamy/pull/457 fix(#456 https://github.com/nahsra/antisamy/issues/456): fix default charset problem.

— Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/pull/457#event-13060505901, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAO6PG4LLTCSVCS2OOT3FNTZF7MG3AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGA3DANJQGU4TAMI . You are receiving this because your review was requested.Message ID: @.***>

spassarop commented 3 months ago

I agree with Kevin. Probably the default Windows encoding was used on policy XML because the content of that file was unlikely to have characters beyond that encoding.

I am not sure why there were error messages in Chinese but were not tested too see if the text would show correctly. I mean, if the display was tested it would look obviously broken.

Probably if an environment does not support UTF-8, AntiSamy usage should not be the main issue.

Then, I share Kevin suggestion to update the encoding to UTF-8 on this code change and test it.

On Thu, 6 Jun 2024 at 02:23 Kevin W. Wall @.***> wrote:

ISO-8859-1 is the default encoding for Windows, but just about every other modern OS uses UTF-8. It can especially bite you in the butt when using String.getBytes(). I recommend changing it to use UTF-8.

See

https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1#7048774 for details of how they are different.

-kevin

On Wed, Jun 5, 2024, 11:52 PM GodMeowIceSun @.***> wrote:

@GodMeowIceSun https://github.com/GodMeowIceSun requested your review on: #457 https://github.com/nahsra/antisamy/pull/457 fix(#456 https://github.com/nahsra/antisamy/issues/456): fix default charset problem.

— Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/pull/457#event-13060505901, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAO6PG4LLTCSVCS2OOT3FNTZF7MG3AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGA3DANJQGU4TAMI>

. You are receiving this because your review was requested.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/pull/457#issuecomment-2151435228, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHL3BMPHIOCU4ZAWHBAEX73ZF7W35AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJRGQZTKMRSHA . You are receiving this because you were mentioned.Message ID: @.***>

GodMeowIceSun commented 3 months ago

I agree with Kevin. Probably the default Windows encoding was used on policy XML because the content of that file was unlikely to have characters beyond that encoding.

I am not sure why there were error messages in Chinese but were not tested too see if the text would show correctly. I mean, if the display was tested it would look obviously broken.

Probably if an environment does not support UTF-8, AntiSamy usage should not be the main issue.

Then, I share Kevin suggestion to update the encoding to UTF-8 on this code change and test it.

On Thu, 6 Jun 2024 at 02:23 Kevin W. Wall @.***> wrote:

ISO-8859-1 is the default encoding for Windows, but just about every other modern OS uses UTF-8. It can especially bite you in the butt when using String.getBytes(). I recommend changing it to use UTF-8.

See

https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1#7048774 for details of how they are different.

-kevin

On Wed, Jun 5, 2024, 11:52 PM GodMeowIceSun @.***> wrote:

@GodMeowIceSun https://github.com/GodMeowIceSun requested your review on: #457 https://github.com/nahsra/antisamy/pull/457 fix(#456 https://github.com/nahsra/antisamy/issues/456): fix default charset problem.

— Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/pull/457#event-13060505901, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAO6PG4LLTCSVCS2OOT3FNTZF7MG3AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGA3DANJQGU4TAMI>

. You are receiving this because your review was requested.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/pull/457#issuecomment-2151435228, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHL3BMPHIOCU4ZAWHBAEX73ZF7W35AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJRGQZTKMRSHA . You are receiving this because you were mentioned.Message ID: @.***>

Perhaps you can take a look at the document I referenced, which proves in reverse that the default encoding of ResourceBundle in the original implementation of Jdk8 was ISO-8859-1, while our project is compatible with Java 8+, and the internationalization method used is ResourceBundle. Perhaps this is the core reason for the problem. I don't think directly modifying it to UTF-8 can solve the problem. Perhaps different versions of JDK need to be treated differently or the Java version should use ISO-8859-1 as the default encoding, which is also a temporary solution

oracle jdk9 doc

kwwall commented 3 months ago

@GodMeowIceSun wrote:

Perhaps you can take a look at the document I referenced, which proves in reverse that the default encoding of ResourceBundle in the original implementation of Jdk8 was ISO-8859-1, while our project is compatible with Java 8+, and the internationalization method used is ResourceBundle. Perhaps this is the core reason for the problem. I don't think directly modifying it to UTF-8 can solve the problem. Perhaps different versions of JDK need to be treated differently or the Java version should use ISO-8859-1 as the default encoding, which is also a temporary solution

oracle jdk9 doc

I did read through this. I think the worst case scenario here is that someone may have to convert the ResourceBundle from ISO-8859-1 to UTF-8. But every Java instance that I've ever worked with, dating back to JDK 1.1, has supported UTF-8, so I don't think this will be a problem, and it's only a 1 line change to check it. If it's a problem, we can convert the encoding for the ResourceBundle. But if it's all in ASCII, we shouldn't even need to do that. I certainly don't think a different minimal JDK version needs to be updated to handle this though.

kwwall commented 2 months ago

I'm not recommending it, but I suppose if you really wanted to, you could tweak the code to use a character set based on a system property and just have it default to UTF-8 if not specified. The one thing I do know if UTF-8 will always be available, regardless of whatever language preferences. I suspect that ISO-8859-1 usually would be, but I'm not sure that is 100% the case if you were using Chinese or Kanji, etc. But UTF-8 is used for other internal encodings. E.g., it's often used with cryptography related code, otherwise going between Windows and *nix often causes problems. For example if shows up in these classes:

$ cd $JAVA_HOME
$ grep -r '"UTF-8"' src | grep /crypto/
com/sun/crypto/provider/PBKDF2KeyImpl.java: *    to bytes using UTF-8 character encoding.
com/sun/crypto/provider/PBKDF2KeyImpl.java:        Charset utf8 = Charset.forName("UTF-8");
sun/security/krb5/internal/crypto/dk/AesDkCrypto.java:            saltUtf8 = salt.getBytes("UTF-8");
sun/security/krb5/internal/crypto/dk/DkCrypto.java:// String.getBytes("UTF-8");
sun/security/krb5/internal/crypto/dk/DkCrypto.java:        Charset utf8 = Charset.forName("UTF-8");
javax/crypto/CryptoPermissions.java:        parser.read(new BufferedReader(new InputStreamReader(in, "UTF-8")));

I think it is used when stuff needs to be serialized / deserialized so it will work across operating systems, but I can't swear to that as it's been 15-20 years since I looked at that code.

That said, ISO-8859-1 would almost certainly work, but it's just the natural choice that UTF-8 is. Just my $.02 though. In ESAPI, we use UTF-8 for everything, but this is your choice. I just think UTF-8 is more neutral whereas ISO-8859-1 has a Western European slant.