marylinh / owasp-esapi-java

Automatically exported from code.google.com/p/owasp-esapi-java
Other
0 stars 0 forks source link

HTMLEntityCodec destroys 32-bit CJK (Chinese, Japanese and Korean) characters #297

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Escape "𡘾𦴩𥻂" with org.owasp.esapi.Encoder#encodeForHTML
2. View the result in a browser

What is the expected output? What do you see instead?
Expected: 𡘾𦴩𥻂
Current: ������

What version of the product are you using? On what operating system?
2.0.1 on Mac OS X 10.8.3

Does this issue affect only a specified browser or set of browsers?
It's the same in Chrome, Firefox and IE.

Please provide any additional information below.
The reason is that 32-bit characters do not fit in a Java char/Character. Here 
some code to illustrate it:

String s = "𡘾𦴩𥻂";
// Wrong:
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); i++) {
    sb.append("&#x").append(Integer.toHexString(s.charAt(i))).append(';');
}
System.out.println(sb); // &#xd845;&#xde3e;&#xd85b;&#xdd29;&#xd857;&#xdec2;

// Correct:
sb = new StringBuilder();
for (int i = 0; i < s.length(); ) {
    int codePoint = s.codePointAt(i);
    sb.append("&#x").append(Integer.toHexString(codePoint)).append(';');
    i += Character.charCount(codePoint);
}
System.out.println(sb); // &#x2163e;&#x26d29;&#x25ec2;

Original issue reported on code.google.com by ri.j...@gmail.com on 4 Apr 2013 at 12:36

GoogleCodeExporter commented 9 years ago
Duplicate of issue 294, reported over a year ago. Seems this project is dead.

Original comment by julian.r...@googlemail.com on 14 Jun 2014 at 7:08