realrolfje / anonimatron

Anonimatron. Providing GDPR compliance since 2010.
https://realrolfje.github.io/anonimatron/
MIT License
105 stars 51 forks source link

Possible encoding issue when processing CSV files #235

Open realrolfje opened 2 months ago

realrolfje commented 2 months ago

Describe the bug In a production situation processing large amounts of CSV records, we sometimes see an IO exception with the text: IOException reading next record: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter.

When this happens, Anonimatron stops writing to the output file, leaving it incomplete (input file is approx. 50MB, outputfile is left at 2.2MB).

To Reproduce Read CSV files from customers. Not sure what exactly causes this yet, it is happening irregularly.

Expected behavior There are a few things we expect:

Logs, screenshots

Error in log tijdens file 5xxxxxxxxxx0000105353
Anonymizing from /efs/ccv/backup/5/5_xxxxxxx0000105353.20240425T040655Z.361E5E026D6F0A07FC611B35C2FEF093.complete
              to /efs/ccv/anonymized/5_xxxxxxx0000105353.20240425T040655Z.361E5E026D6F0A07FC611B35C2FEF093.complete
Exception in thread "main" java.io.UncheckedIOException: IOException reading next record: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:150)
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.hasNext(CSVParser.java:160)
        at java.base/java.util.Iterator.forEachRemaining(Iterator.java:132)
        at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
        at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
        at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
        at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:650)
        at xxx.anonymize.csv.CSVReader.parseLine(CSVReader.java:85)
        at xxx.anonymize.csv.CSVReader.read(CSVReader.java:46)
        at com.rolfje.anonimatron.file.FileAnonymizerService.anonymize(FileAnonymizerService.java:183)
        at com.rolfje.anonimatron.file.FileAnonymizerService.anonymize(FileAnonymizerService.java:87)
        at com.rolfje.anonimatron.Anonimatron.anonymize(Anonimatron.java:103)
        at com.rolfje.anonimatron.Anonimatron.main(Anonimatron.java:67)
Caused by: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
        at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:369)
        at org.apache.commons.csv.Lexer.nextToken(Lexer.java:290)
        at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:770)
        at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:148)
        ... 15 more

Further details are in personal mail because of possibly sensitive data or customer information.

Desktop (please complete the following information):

Additional context Please inform when this problem is fixed, so that we can (fix and) re-process the incomplete files.