uniVocity / univocity-parsers

uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.
917 stars 252 forks source link

Values between "a quoted and escaped quote" and "a quoted value, that starts with the delimiter" are skipped #508

Open kasgilpofi opened 2 years ago

kasgilpofi commented 2 years ago

Version of Univocity

<dependency>
    <groupId>com.univocity</groupId>
    <artifactId>univocity-parsers</artifactId>
    <version>2.9.1</version>
</dependency>

Problem

Parsing a valid (Rfc 4180) csv file, which contains "a quoted and escaped quote" ("""") and "a quoted value, that starts with the delimiter" (e.g. ";abc").

Using -selectFields -NormalizeLineEndingsWithinQuotes=false

the values between "a quoted and escaped quote" and "a quoted value, that starts with the delimiter" are skipped.

The problem does not occur with NormalizeLineEndingsWithinQuotes=true.

The problem appears to be caused by

AbstractCharInputReader.skipQuotedString(char quote, char escape, char stop1, char stop2)

which doesn't seem to properly handle "quoted and escaped quotes"

CSV-Data

A line, that contains a single quote (quoted and escaped with quote). The next quoted value starts with the delimiter.

e.g.

1;"""";100
2;abc;101
10;";abc";200

Example

import com.univocity.parsers.common.Context;
import com.univocity.parsers.common.processor.core.Processor;
import com.univocity.parsers.csv.Csv;
import com.univocity.parsers.csv.CsvParser;
import com.univocity.parsers.csv.CsvParserSettings;
import com.univocity.parsers.csv.UnescapedQuoteHandling;

import java.io.ByteArrayInputStream;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;

public class CsvParserTesting {

    public static void main(String[] args) {
        try {
            CsvParserSettings settings = Csv.parseRfc4180();

            settings.getFormat().setDelimiter(";");
            settings.getFormat().setLineSeparator("\n");
            settings.getFormat().setQuote('"');
            settings.getFormat().setQuoteEscape('"');
            settings.getFormat().setComment('#');

            settings.setMaxColumns(300);
            settings.setMaxCharsPerColumn(-1);
            settings.setEmptyValue("");
            settings.setNullValue("");
            settings.setIgnoreTrailingWhitespaces(true);
            settings.setIgnoreLeadingWhitespaces(true);
            settings.setReadInputOnSeparateThread(false);
            settings.setSkipEmptyLines(true);
            settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER);
            settings.setErrorContentLength(1000);
            settings.setHeaders("A", "B", "C");
            settings.selectFields("A", "C");

            settings.setNormalizeLineEndingsWithinQuotes(false);

            settings.setProcessor(new Processor<Context>() {
                @Override
                public void processStarted(Context context) {
                    System.out.println("processStarted");
                }

                @Override
                public void rowProcessed(String[] row, Context context) {
                    System.out.println(Arrays.toString(row));
                }

                @Override
                public void processEnded(Context context) {
                    System.out.println("processEnded");
                }
            });

            CsvParser csvParser = new CsvParser(settings);

            String text = "";

            text += "1;\"\"\"\";100";
            text += "\n2;abc;101";
            text += "\n10;\";abc\";200";
            ByteArrayInputStream is = new ByteArrayInputStream(text.getBytes(StandardCharsets.UTF_8));

            csvParser.parse(is);

        } catch (Throwable th) {
            th.printStackTrace();
        }
    }
}

Expected output

processStarted
[1, 100]
[2, 101]
[10, 200]
processEnded

Actual Output

processStarted
[1, abc"]
processEnded