CsvReader throws: `arraycopy: last source index 8193 out of bounds for char[8192]`

anna-geller commented 10 months ago

Feature description

Full stacktrace:

arraycopy: last source index 8193 out of bounds for char[8192]
2023-12-18 18:32:46.097java.lang.ArrayIndexOutOfBoundsException: arraycopy: last source index 8193 out of bounds for char[8192]
    at java.base/java.lang.System.arraycopy(Native Method)
    at de.siegmar.fastcsv.reader.ReusableStringBuilder.append(ReusableStringBuilder.java:65)
    at de.siegmar.fastcsv.reader.RowReader.readLine(RowReader.java:75)
    at de.siegmar.fastcsv.reader.CsvParser.nextRow(CsvParser.java:85)
    at io.kestra.plugin.serdes.csv.CsvReader.lambda$nextRow$3(CsvReader.java:146)
    at io.reactivex.internal.operators.flowable.FlowableCreate.subscribeActual(FlowableCreate.java:71)
    at io.reactivex.Flowable.subscribe(Flowable.java:14935)
    at io.reactivex.internal.operators.flowable.FlowableFilter.subscribeActual(FlowableFilter.java:37)
    at io.reactivex.Flowable.subscribe(Flowable.java:14935)
    at io.reactivex.internal.operators.flowable.FlowableMap.subscribeActual(FlowableMap.java:37)
    at io.reactivex.Flowable.subscribe(Flowable.java:14935)
    at io.reactivex.internal.operators.flowable.FlowableDoOnEach.subscribeActual(FlowableDoOnEach.java:50)
    at io.reactivex.Flowable.subscribe(Flowable.java:14935)
    at io.reactivex.internal.operators.flowable.FlowableCountSingle.subscribeActual(FlowableCountSingle.java:34)
    at io.reactivex.Single.subscribe(Single.java:3666)
    at io.reactivex.Single.blockingGet(Single.java:2869)
    at io.kestra.plugin.serdes.csv.CsvReader.run(CsvReader.java:122)
    at io.kestra.plugin.serdes.csv.CsvReader.run(CsvReader.java:28)
    at io.kestra.core.runners.Worker$WorkerThread.run(Worker.java:711)

Reproducer:

id: hello-world
namespace: company.team
tasks:
  - id: get
    type: io.kestra.plugin.fs.http.Download
    uri: https://huggingface.co/datasets/kestra/datasets/resolve/main/csv/USvideos.csv?download=true

  - id: csv_reader
    type: io.kestra.plugin.serdes.csv.CsvReader
    from: "{{ outputs.get.uri }}"
    header: true
    charset: ASCII # UTF-8

reading in pandas works fine

"I reduced the csv file to only 500 rows, still the same error."

Skraye commented 10 months ago

I guess that the issue is due to your file, for example if I remove the line 47 (iPhone X vs Makeup Transformation (Face ID TEST)), the error happen on another line later

I took, the first 50 lines, remove the one mention above and it works fine

anna-geller commented 10 months ago

@Skraye we still need a proper solution for it e.g. a property to decide what to do with bad lines

in pandas, there is a property "on_bad_lines":

on_bad_lines{‘error’, ‘warn’, ‘skip’} or Callable, default ‘error’
Specifies what to do upon encountering a bad line (a line with too many fields). Allowed values are :
'error', raise an Exception when a bad line is encountered.
'warn', raise a warning when a bad line is encountered and skip that line.
'skip', skip bad lines without raising or warning when they are encountered.

seems OK to do the same what pandas does - e.g. adding enum property onBadLines with options ERROR, WARN or SKIP

Skraye commented 10 months ago

Sound good to me! Should we also output the bad row with the error message to help the user debug ?

anna-geller commented 10 months ago

good idea, in ERROR case yes 👍

if WARN, perhaps only output all bad rows as one file in internal storage in case there are many bad rows (imagine a large file with many bad rows)

in SKIP, no need to output bad rows

loicmathieu commented 10 months ago

THis is indeed a bug and is fixed in the latest version of the CSV library we used.

However, the idea to offer a way to manage corrupted rows is a good idea, if we do this, we should do it for all serdes reader not only for CSV. Please open an issue describing it.

I'll close this issue with a fix so this file is correctly hanlded.

kestra-io / plugin-serdes

CsvReader throws: `arraycopy: last source index 8193 out of bounds for char[8192]` #78

Feature description