uniVocity / univocity-parsers

uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.
915 stars 251 forks source link

Incorrect parsing of escape character and quotation mark in csv data #495

Open 16AnishV opened 2 years ago

16AnishV commented 2 years ago

I've set my escape character to be a \ backslash and expect it to escape other backslashes and quotation marks but I see inconsistent behavior from the parser. Let's look at this example.

/*
    Source input data:
        text1,text2  // header
        text,"01"
        text\,"02"
        "text\","03"
        ""text\"","04"
        """text\""","05"
        """text\\""","06"
 */
private static final String INPUT =
        "text1,text2\n" +
        "text,\"01\"\n" +
        "text\\,\"02\"\n" +
        "\"text\\\",\"03\"\n" +
        "\"\"text\\\"\",\"04\"\n" +
        "\"\"\"text\\\"\"\",\"05\"\n" +
        "\"\"\"text\\\\\"\"\",\"06\"\n";

@Test
public void testParser() {
    CsvParserSettings parserSettings = new CsvParserSettings();
    parserSettings.detectFormatAutomatically();
    parserSettings.setHeaderExtractionEnabled(true);
    parserSettings.setIgnoreLeadingWhitespaces(true);
    parserSettings.setIgnoreTrailingWhitespaces(true);
    parserSettings.setSkipEmptyLines(true);
    parserSettings.getFormat().setQuoteEscape('\\');
    parserSettings.getFormat().setCharToEscapeQuoteEscaping('\\');

    RowListProcessor rowProcessor = new RowListProcessor();
    parserSettings.setProcessor(rowProcessor);

    CsvParser parser = new CsvParser(parserSettings);

    parser.iterateRecords(IOUtils.toInputStream(INPUT, StandardCharsets.UTF_8))
            .forEach(record -> System.out.printf("record:[ %s ] --- #cols: %s%n", record, record.getValues().length));
}

I intend for this output to match the source data but this is what I get:

record:[ text, 01 ] --- #cols: 2
record:[ text\, 02 ] --- #cols: 2
record:[ text\, 03 ] --- #cols: 2
record:[ ""text\"", 04 ] --- #cols: 2
record:[ ""text"","05" ] --- #cols: 1
record:[ ""text"","06" ] --- #cols: 1

The number of quotation marks are incorrect and the \ backslash escape character seems to be getting ignored in some cases. Crucially, the columns in rows 5 and 6 are concatenated into a single column rather than the 2 columns that exist in the source data. This data is a little odd but the problem is that I don't see consistent behavior.

Rows 5 and 6 should not have the same-shaped output; I expect row 6 to contain a backslash. It seems that parserSettings.getFormat().setCharToEscapeQuoteEscaping('\\'); doesn't work here.

Rows 3, 4, and 5 just don't honor the escape character, granted, I could maybe understand some weird behavior here since honoring the escape would result in mismatched quotation marks.

I've tried various CSVParserSettings options and found nothing that outputs 2 columns for rows 5 and 6. Could I please get an explanation / some help?

16AnishV commented 2 years ago

@jbax would you or another contributor be able to take a look at this issue please?