jruby / jcodings

Java-based codings helper classes for Joni and JRuby
MIT License
21 stars 29 forks source link

Econv behaviour for gb18030 #39

Closed LillianZ closed 4 years ago

LillianZ commented 4 years ago

Should the last two lines should be equal? Thanks!

EConv econv = TranscoderDB.open("UTF-8", "gb18030", 0);

byte[] src = "Lašas".getBytes("UTF-8");
byte[] dest = new byte["Lašas".getBytes("gb18030").length];

econv.convert(src, new Ptr(0), 6, dest, new Ptr(0), dest.length, 0);
System.out.println(Arrays.toString(dest)); 
// [76, 97, -127, 48, 18, 56, 97, 115]
System.out.println(Arrays.toString("Lašas".getBytes("gb18030"))); 
// [76, 97, -127, 48, -108, 56, 97, 115]
headius commented 4 years ago

Assuming the encoding of these characters is the same in Ruby, the GB18030 bytes appear to match:

[] ~/projects/jruby $ ruby -e 'p "Lašas".encode("gb18030").bytes'
[76, 97, 129, 48, 148, 56, 97, 115]

I do not have an explanation for why the Java GB18030 encoder produces different output.

headius commented 4 years ago

Based on this online converter, we also match:

$ ruby -e 'p "Lašas".encode("gb18030").bytes.map{|i| i.to_s(16)}'
["4c", "61", "81", "30", "94", "38", "61", "73"]

image

I would say the Java encoder is in error here.

headius commented 4 years ago

Actually now I see that the Java getBytes matches Ruby but the manually transcoded result is not correct in your example.

I made this into a test class and I believe the latest jcodings should match. Perhaps you are running against an old version?

$ java -cp ../jcodings/target/jcodings.jar:. Blah
[76, 97, -127, 48, -108, 56, 97, 115]
[76, 97, -127, 48, -108, 56, 97, 115]
import org.jcodings.*;
import org.jcodings.transcode.*;
import java.util.*;

public class Blah {
public static void main(String[] args) throws Throwable {
EConv econv = TranscoderDB.open("UTF-8", "gb18030", 0);

byte[] src = "Lašas".getBytes("UTF-8");
byte[] dest = new byte["Lašas".getBytes("gb18030").length];

econv.convert(src, new Ptr(0), 6, dest, new Ptr(0), dest.length, 0);
System.out.println(Arrays.toString(dest)); 
// [76, 97, -127, 48, 18, 56, 97, 115]
System.out.println(Arrays.toString("Lašas".getBytes("gb18030"))); 
// [76, 97, -127, 48, -108, 56, 97, 115]
}
}
headius commented 4 years ago

Possibly fixed by @k77ch7 in 408210ce852febb2959f2bcdc460f2c91c195117. In any case, it's no longer broken.

LillianZ commented 4 years ago

Yes, I was using an old version, thanks for helping me debug!