uniVocity / univocity-parsers

uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.
917 stars 252 forks source link

Unicode Special Character at the Beginning is Corrupted #491

Open justdoit-amazon opened 2 years ago

justdoit-amazon commented 2 years ago

When parsing a file that starts with a unicode special character, the unicode special character is replaced with the replacement character,

For example, a UTF-8 file without a BOM containing のTESTING will be parsed as ��TESTING.

This is a result of the BOM logic added.

Sample code:

val settings = new CsvParserSettings
    unescapedSettings.getFormat.setQuoteEscape('\u0000')
    unescapedSettings.getFormat.setQuote('\u0000')
    unescapedSettings.setUnescapedQuoteHandling(STOP_AT_DELIMITER)
    unescapedSettings.setQuoteDetectionEnabled(false)
val parser = new CsvParser(settings)
val peekableData = new PushbackInputStream(data)
parser.beginParsing(peekableData)

Explicitly passing the charsetName of "UTF-8" into beginParsing is a workaround for the issue.