jruby / jcodings

Java-based codings helper classes for Joni and JRuby
MIT License
21 stars 29 forks source link

ArrayIndexOutOfBoundsExceptions when transcoding UTF8-SoftBank=>SJIS-KDDI or CP51932=>CP50220 #42

Open djoooooe opened 3 years ago

djoooooe commented 3 years ago

The following unit tests crash in org.jcodings.transcode.Transcoding:

@Test
public void test1() {
    byte[] src = {0, -19, -97, -65, -18, -128, -128, -12, -113, -65, -65};
    byte[] dst = new byte[8];
    Ptr srcPtr = new Ptr(7);
    Ptr dstPtr = new Ptr(0);
    EConv econv = TranscoderDB.open("UTF8-SoftBank", "SJIS-KDDI", 0);
    econv.convert(src, srcPtr, src.length, dst, dstPtr, dst.length, 0);
}
java.lang.ArrayIndexOutOfBoundsException: 3
    at org.jcodings.transcode.Transcoding.transcodeRestartable0(Transcoding.java:172)
    at org.jcodings.transcode.Transcoding.transcodeRestartable(Transcoding.java:105)
    at org.jcodings.transcode.Transcoding.convert(Transcoding.java:86)
    at org.jcodings.transcode.EConv.transSweep(EConv.java:236)
    at org.jcodings.transcode.EConv.transConvNeedReport(EConv.java:300)
    at org.jcodings.transcode.EConv.transConv(EConv.java:294)
    at org.jcodings.transcode.EConv.convertInternal(EConv.java:410)
    at org.jcodings.transcode.EConv.convert(EConv.java:452)
@Test
public void test2() {
    byte[] src = {0, 127, -114, -95, -114, -2, -95, -95, -95, -2, -94, -95, -94, -2, -93, -95, -93, -2, -92, -95, -92, -2, -91, -95, -91, -2, -90, -95, -90, -2, -89, -95, -89, -2, -88, -95, -88,
                    -2, -87, -95, -87, -2, -86, -95, -86, -2, -85, -95, -85, -2, -84, -95, -84, -2, -83, -95, -83, -2, -82, -95, -82, -2, -81, -95, -81, -2, -80, -95, -80, -2, -79, -95, -79, -2,
                    -78, -95, -78, -2, -77, -95, -77, -2, -76, -95, -76, -2, -75, -95, -75, -2, -74, -95, -74, -2, -73, -95, -73, -2, -72, -95};
    byte[] dst = new byte[100];
    Ptr srcPtr = new Ptr(0);
    Ptr dstPtr = new Ptr(0);
    EConv econv = TranscoderDB.open("CP51932", "CP50220", 0);
    econv.convert(src, srcPtr, src.length, dst, dstPtr, dst.length, 0);
}
java.lang.ArrayIndexOutOfBoundsException: 186
    at org.jcodings.transcode.TranscodeFunctions.funSoCp50220Encoder(TranscodeFunctions.java:528)
    at org.jcodings.transcode.specific.Cp50220_encoder_Transcoder.startToOutput(Cp50220_encoder_Transcoder.java:45)
    at org.jcodings.transcode.Transcoding.transcodeRestartable0(Transcoding.java:307)
    at org.jcodings.transcode.Transcoding.transcodeRestartable(Transcoding.java:105)
    at org.jcodings.transcode.Transcoding.convert(Transcoding.java:86)
    at org.jcodings.transcode.EConv.transSweep(EConv.java:236)
    at org.jcodings.transcode.EConv.transConvNeedReport(EConv.java:300)
    at org.jcodings.transcode.EConv.transConv(EConv.java:294)
    at org.jcodings.transcode.EConv.convertInternal(EConv.java:406)
    at org.jcodings.transcode.EConv.convert(EConv.java:452)
headius commented 3 years ago

Good find! Could be a bug in the transcoder (these were loose ports from the C code in Ruby) or a bad/old unicode table.

@lopex got a change to look at this? Maybe it's another one-character fix. 😀

headius commented 3 years ago

I took a look into the CP50220 issue and read through the relevant functions (org.jcodings.transcode.TranscodeFunctions#funSoCp50220Encoder, org.jcodings.transcode.TranscodeFunctions#funSoCp5022xEncoder) and the data table used here (org.jcodings.transcode.TranscodeFunctions#tbl0208) and everything appears to match the C implementation.

Reduced case can use {0, 127, -114, -95, -114, -2} because it blows up on the first -2. Running in Ruby you can use the following snippit of code:

"\x00\x7f\x8e\xa1\x8e\xfe\xa1\xa1\xa1\xfe".force_encoding("CP51932").encode("CP50220")

It blows up in JRuby and works in CRuby.