uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.
917
stars
252
forks
source link
Multi-byte characters are not taken into account? #436
The problem occurs when reading fixed-length data that contains a mixture of multibyte and single-byte characters.
Specifying the encoding does not seem to make any difference.
Below is an example code.
FixedWidthFields fields = new FixedWidthFields();
fields.addField("lookahead", 1);
fields.addField("dataString", 6);
fields.addField("date", 8);
FixedWidthParserSettings settings = new FixedWidthParserSettings();
settings.addFormatForLookahead("1", fields);
FixedWidthParser parser = new FixedWidthParser(settings);
byte[] ms932Bytes = "1あああ20201218".getBytes(Charset.forName("MS932"));
ByteArrayInputStream bais = new ByteArrayInputStream(ms932Bytes);
parser.beginParsing(bais, Charset.forName("MS932"));
Record record = parser.parseNextRecord();
Expect
1, あああ, 20201218
Actual
1, あああ202, 01218
My guess is that 'あ' is 2bytes in MS932, but it is actually counted as 1byte.
Would you like to be able to count multibyte characters correctly?
(I'm sorry, this is a machine-translated sentence, so it may sound strange.)
The problem occurs when reading fixed-length data that contains a mixture of multibyte and single-byte characters. Specifying the encoding does not seem to make any difference.
Below is an example code.
Expect
1, あああ, 20201218
Actual
1, あああ202, 01218
My guess is that 'あ' is 2bytes in MS932, but it is actually counted as 1byte. Would you like to be able to count multibyte characters correctly?
(I'm sorry, this is a machine-translated sentence, so it may sound strange.)