uniVocity / univocity-parsers

uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.
917 stars 252 forks source link

Multi-byte characters are not taken into account? #436

Closed YusukeD closed 3 years ago

YusukeD commented 3 years ago

The problem occurs when reading fixed-length data that contains a mixture of multibyte and single-byte characters. Specifying the encoding does not seem to make any difference.

Below is an example code.

FixedWidthFields fields = new FixedWidthFields();
fields.addField("lookahead", 1);
fields.addField("dataString", 6);
fields.addField("date", 8);

FixedWidthParserSettings settings = new FixedWidthParserSettings();
settings.addFormatForLookahead("1", fields);
FixedWidthParser parser = new FixedWidthParser(settings);

byte[] ms932Bytes = "1あああ20201218".getBytes(Charset.forName("MS932"));
ByteArrayInputStream bais = new ByteArrayInputStream(ms932Bytes);
parser.beginParsing(bais, Charset.forName("MS932"));

Record record = parser.parseNextRecord();

Expect

1, あああ, 20201218

Actual

1, あああ202, 01218

My guess is that 'あ' is 2bytes in MS932, but it is actually counted as 1byte. Would you like to be able to count multibyte characters correctly?

(I'm sorry, this is a machine-translated sentence, so it may sound strange.)

jbax commented 3 years ago

The parser counts characters instead of bytes so for the input you have, defining fields.addField("dataString", 3); would produce 1, あああ, 20201218

Hope this helps