ArrayIndexOutOfBoundException in AbstractCharInputReader when trim = true

icassina commented 3 years ago

After upgrading from spark 2.4 to spark 3.0.1, we experienced a regression in our tests.

Reading CSV file was fine before, but now, sometimes, it triggered an ArrayIndexOutOfBoundException in AbstractCharInputReader:

Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
        at com.univocity.parsers.common.input.AbstractCharInputReader.getString(AbstractCharInputReader.java:482)
        at com.univocity.parsers.csv.CsvParser.parseSingleDelimiterRecord(CsvParser.java:186)
        at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:109)
        at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:577)

Here's the configuration:

Job aborted due to stage failure: Task 0 in stage 20799.0 failed 1 times, most recent failure: Lost task 0.0 in stage 20799.0 (TID 71037, ip-10-0-1-172.ec2.internal, executor driver): com.univocity.parsers.common.TextParsingException: java.lang.ArrayIndexOutOfBoundsException - -1
Hint: Number of characters processed may have exceeded limit of -1 characters per column. Use settings.setMaxCharsPerColumn(int) to define the maximum number of characters a column can have
Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse
Parser Configuration: CsvParserSettings:
    Auto configuration enabled=true
    Auto-closing enabled=true
    Autodetect column delimiter=false
    Autodetect quotes=false
    Column reordering enabled=true
    Delimiters for detection=null
    Empty value=
    Escape unquoted values=false
    Header extraction enabled=null
    Headers=null
    Ignore leading whitespaces=false
    Ignore leading whitespaces in quotes=false
    Ignore trailing whitespaces=true
    Ignore trailing whitespaces in quotes=false
    Input buffer size=128
    Input reading on separate thread=false
    Keep escape sequences=false
    Keep quotes=false
    Length of content displayed on error=1000
    Line separator detection enabled=true
    Maximum number of characters per column=-1
    Maximum number of columns=20480
    Normalize escaped line separators=true
    Null value=
    Number of records to read=all
    Processor=none
    Restricting data in exceptions=false
    RowProcessor error handler=null
    Selected fields=none
    Skip bits as whitespace=true
    Skip empty lines=true
    Unescaped quote handling=STOP_AT_DELIMITER
Format configuration:
    CsvFormat:
        Comment character=#
        Field delimiter=\t
        Line separator (normalized)=\n
        Line separator sequence=\n
        Quote character="
        Quote escape character="
        Quote escape escape character=null
Internal state when error was thrown: line=1, column=11, record=1, charIndex=257, headers=[/* redacted for privacy */]

The input file is a \t separated CSV with \r\n newlines. The code was executed on a linux machine.

icassina commented 3 years ago

The exception does not happen when setting "Ignore trailing whitespaces" to false, but it parses and extra column for all rows, containing ^M (\r)

HyukjinKwon commented 3 years ago

@jbax, FYI this is the self-contained reproducer:

import com.univocity.parsers.csv.CsvParser;
import com.univocity.parsers.csv.CsvParserSettings;

class Issue449 {
    public static void main(String[] args) {
        CsvParserSettings settings = new CsvParserSettings();
        settings.getFormat().setDelimiter("|");
        settings.setIgnoreLeadingWhitespaces(false);
        settings.setInputBufferSize(128);

        CsvParser parser = new CsvParser(settings);
        String line = "XX   |XXX-XXXX            |XXXXXX              " +
                "|XXXXXXXX|XXXXX               |XXXXXX              " +
                "|X|XXXXXXX|XXXXXXXX|XXXX|XXXXXXXXXXXXXXX     |XXXXXXXXXXX" +
                "|XXXXXX              |XXXXXXXXXXXXXXXXXXXXXX|XXXXXX              " +
                "|XXXXXXXXXXXXXX|XXXXXX              |XXXXXXXXXXXXXXXXXXXXXX" +
                "|XXXXXX              |XXXXXXXXXXXXXXXXXXXXXX|XXXXXX              " +
                "|XXXXXXXXX|XXXXXX              |XXXXXXX|                    " +
                "||                    ||                    " +
                "||                    ||XXXX-XX-XX 00:00:00.0000000" +
                "||XXXXX.XXXXXXXXXXXXXXX|XXXXX.XXXXXXXXXXXXXX" +
                "|XXXXX.XXXXXXXXXXXXXXX|X|XXXXXX              |X";
        parser.parseLine(line);
    }
}

jbax commented 3 years ago

Fixed, I'll release version 2.9.2 tomorrow with the adjustment.

HyukjinKwon commented 3 years ago

Thanks guys!

rupeshbhujbal41184 commented 2 years ago

when the 2.9.2 version will be available for download?

Cant see in the download section yet https://www.univocity.com/pages/univocity_parsers_download

uniVocity / univocity-parsers

ArrayIndexOutOfBoundException in AbstractCharInputReader when trim = true #449