uniVocity / univocity-parsers

uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.
915 stars 251 forks source link

Incorrect column pruning #529

Open MaxGekk opened 10 months ago

MaxGekk commented 10 months ago

On the following file:

"1","DE","","Yes"
"5",",","",","
"3","SA","","No"
"10","abcd""efgh"" \ndef","",""

when I select the index 0, I would expect 1, 5, 3, 10 but got 1, 5, 10.

Here is the example which reproduces the issue:

    CsvParserSettings settings = new CsvParserSettings();
    CsvFormat format = settings.getFormat();
    format.setQuoteEscape('"');
    settings.selectIndexes(0);

    CsvParser parser = new CsvParser(settings);
    File initialFile = new File("test.csv");
    InputStream inputStream = new FileInputStream(initialFile);
    List<String[]> allLines = parser.parseAll(inputStream);

    int count = 0;
    for(String[] line : allLines){
      System.out.println("Line " + ++count);
      for(String element : line){
        System.out.println("\t" + element);
      }
      System.out.println();
    }

the output is:

Line 1
    1

Line 2
    5

Line 3
    10

but when I select at least 3 indexes (0, 1, 2) or remove settings.selectIndexes(0), the output is correct.

settings.selectIndexes(0, 1, 2);
Line 1
    1
    DE
    null

Line 2
    5
    ,
    null

Line 3
    3
    SA
    null

Line 4
    10
    abcd"efgh" \ndef
    null
MaxGekk commented 10 months ago

We faced to the issue in Apache Spark since the column pruning feature is enabled by default in the CSV datasource. It would be nice to fix the issue in uniVocity instead of disabling the feature by default. cc @cloud-fan @HyukjinKwon