uniVocity / univocity-parsers

uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.
905 stars 249 forks source link

Incorrect parsing with auto-detected "\r" line endings when normalizeLineEndingsWithinQuotes=false #499

Open eirikbakke opened 2 years ago

eirikbakke commented 2 years ago

The following CSV file, with "\r" style line endings...

colA,colB,colC
a,A,"x"
b,B,k

...should parse as [[colA, colB, colC], [a, A, x], [b, B, k]]. However, when lineSeparatorDetectionEnabled=true and normalizeLineEndingsWithinQuotes=false, I instead get [[colA, colB, colC], [a, A, "x"\rb, B, k]].

Here is a complete test case, which fails with Univocity 2.9.1 on Windows 11 and Java 17:

import com.univocity.parsers.csv.CsvFormat;
import com.univocity.parsers.csv.CsvParser;
import com.univocity.parsers.csv.CsvParserSettings;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import org.junit.Assert;
import org.junit.Test;

public class UnivocityLineEndingBugTest {
  private static final boolean TRIGGER_BUG = true;

  private static CsvParserSettings createUnivocitySettings() {
    final CsvParserSettings settings = new CsvParserSettings();
    final CsvFormat format = settings.getFormat();
    settings.setDelimiterDetectionEnabled(false);
    format.setDelimiter(',');
    settings.setQuoteDetectionEnabled(false);
    format.setQuote('\"');
    format.setQuoteEscape('\"');
    settings.setKeepEscapeSequences(false);
    settings.setKeepQuotes(false);

    // Setting this to true will also cause the bug to go away.
    settings.setNormalizeLineEndingsWithinQuotes(false);
    //format.setNormalizedNewline('\n');
    if (TRIGGER_BUG) {
      settings.setLineSeparatorDetectionEnabled(true);
    } else {
      settings.setLineSeparatorDetectionEnabled(false);
      format.setLineSeparator("\r");
    }
    return settings;
  }

  @Test
  public void testBug() throws IOException {
    String csvFile =
        "colA,colB,colC\r" +
        "a,A,\"x\"\r" +
        "b,B,k\r";
    CsvParserSettings settings = createUnivocitySettings();
    List<List<String>> result = new ArrayList<>();
    try (Reader reader = new StringReader(csvFile)) {
      CsvParser parser = new CsvParser(settings);
      parser.beginParsing(reader);
      while (true) {
        String row[] = parser.parseNext();
        if (row == null)
          break;
        // System.out.println(Arrays.toString(row));
        result.add(new ArrayList<>(Arrays.asList(row)));
      }
    }
    System.out.println(result.toString());
    Assert.assertEquals("[[colA, colB, colC], [a, A, x], [b, B, k]]", result.toString());
  }
}

Thank you for your work on the excellent Univocity library! I am using it for Ultorg and am in the process of writing unit tests, which is how I found the bug above...

eirikbakke commented 2 years ago

Also note that the Javadoc and parameter name for CharInputReader.enableNormalizeLineEndings(escaping) seems to reverse the actual behavior of the method as assumed by callers and implemented in AbstractCharInputReader. In fact, in the latter overridden method, the parameter has been renamed to normalizeLineEndings, which seems like a more correct name.