Describe the bug
In a production situation processing large amounts of CSV records, we sometimes see an IO exception with the text: IOException reading next record: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter.
When this happens, Anonimatron stops writing to the output file, leaving it incomplete (input file is approx. 50MB, outputfile is left at 2.2MB).
To Reproduce
Read CSV files from customers. Not sure what exactly causes this yet, it is happening irregularly.
Expected behavior
There are a few things we expect:
Anonimatron should correctly handle file encodings.
If the file encoding is correct but the contents is broken, the error should be a bit more informative about the problem
When writing to the output file stops because of an error, the output file should be deleted and an exit code should indicate that anonimization failed.
Logs, screenshots
Error in log tijdens file 5xxxxxxxxxx0000105353
Anonymizing from /efs/ccv/backup/5/5_xxxxxxx0000105353.20240425T040655Z.361E5E026D6F0A07FC611B35C2FEF093.complete
to /efs/ccv/anonymized/5_xxxxxxx0000105353.20240425T040655Z.361E5E026D6F0A07FC611B35C2FEF093.complete
Exception in thread "main" java.io.UncheckedIOException: IOException reading next record: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:150)
at org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:160)
at java.base/java.util.Iterator.forEachRemaining(Iterator.java:132)
at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:650)
at xxx.anonymize.csv.CSVReader.parseLine(CSVReader.java:85)
at xxx.anonymize.csv.CSVReader.read(CSVReader.java:46)
at com.rolfje.anonimatron.file.FileAnonymizerService.anonymize(FileAnonymizerService.java:183)
at com.rolfje.anonimatron.file.FileAnonymizerService.anonymize(FileAnonymizerService.java:87)
at com.rolfje.anonimatron.Anonimatron.anonymize(Anonimatron.java:103)
at com.rolfje.anonimatron.Anonimatron.main(Anonimatron.java:67)
Caused by: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:369)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:290)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:770)
at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:148)
... 15 more
Further details are in personal mail because of possibly sensitive data or customer information.
Desktop (please complete the following information):
OS: CentOS
Java version OpenJDK 1.17
Anonimatron v1.15
Additional context
Please inform when this problem is fixed, so that we can (fix and) re-process the incomplete files.
Describe the bug In a production situation processing large amounts of CSV records, we sometimes see an IO exception with the text: IOException reading next record: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter.
When this happens, Anonimatron stops writing to the output file, leaving it incomplete (input file is approx. 50MB, outputfile is left at 2.2MB).
To Reproduce Read CSV files from customers. Not sure what exactly causes this yet, it is happening irregularly.
Expected behavior There are a few things we expect:
Logs, screenshots
Further details are in personal mail because of possibly sensitive data or customer information.
Desktop (please complete the following information):
Additional context Please inform when this problem is fixed, so that we can (fix and) re-process the incomplete files.