Possible encoding issue when processing CSV files

Describe the bug In a production situation processing large amounts of CSV records, we sometimes see an IO exception with the text: IOException reading next record: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter.

When this happens, Anonimatron stops writing to the output file, leaving it incomplete (input file is approx. 50MB, outputfile is left at 2.2MB).

To Reproduce Read CSV files from customers. Not sure what exactly causes this yet, it is happening irregularly.

Expected behavior There are a few things we expect:

Anonimatron should correctly handle file encodings.
If the file encoding is correct but the contents is broken, the error should be a bit more informative about the problem
When writing to the output file stops because of an error, the output file should be deleted and an exit code should indicate that anonimization failed.

Logs, screenshots

Error in log tijdens file 5xxxxxxxxxx0000105353
Anonymizing from /efs/ccv/backup/5/5_xxxxxxx0000105353.20240425T040655Z.361E5E026D6F0A07FC611B35C2FEF093.complete
              to /efs/ccv/anonymized/5_xxxxxxx0000105353.20240425T040655Z.361E5E026D6F0A07FC611B35C2FEF093.complete
Exception in thread "main" java.io.UncheckedIOException: IOException reading next record: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:150)
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:160)
        at java.base/java.util.Iterator.forEachRemaining(Iterator.java:132)
        at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
        at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
        at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
        at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:650)
        at xxx.anonymize.csv.CSVReader.parseLine(CSVReader.java:85)
        at xxx.anonymize.csv.CSVReader.read(CSVReader.java:46)
        at com.rolfje.anonimatron.file.FileAnonymizerService.anonymize(FileAnonymizerService.java:183)
        at com.rolfje.anonimatron.file.FileAnonymizerService.anonymize(FileAnonymizerService.java:87)
        at com.rolfje.anonimatron.Anonimatron.anonymize(Anonimatron.java:103)
        at com.rolfje.anonimatron.Anonimatron.main(Anonimatron.java:67)
Caused by: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
        at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:369)
        at org.apache.commons.csv.Lexer.nextToken(Lexer.java:290)
        at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:770)
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:148)
        ... 15 more

Further details are in personal mail because of possibly sensitive data or customer information.

Desktop (please complete the following information):

OS: CentOS
Java version OpenJDK 1.17
Anonimatron v1.15

Additional context Please inform when this problem is fixed, so that we can (fix and) re-process the incomplete files.

realrolfje / anonimatron

Possible encoding issue when processing CSV files #235