openpreserve / odf-validator

Open source Open Document Format (ODF) validation
http://odf.openpreservation.org/
BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

Out of memory for large batch of .ods files: all result messages lost #155

Open RvanVeenendaal opened 2 months ago

RvanVeenendaal commented 2 months ago

When running the ODF Validator against a large set of .ods files - in my case 1056 (copies of copies, etc.) - it crashes with an OutOfMemoryError.

With -Xmx4000m, -Xmx6000m and even -Xmx8000m it runs longer, but eventually also crashes. With the -Xmx6000m parameter, the process used 4.2GB RAM just before crashing. With -Xmx8000m 5.6GB RAM. Tested on an 8GB RAM, 3.10 GHz Intel Xeon Gold 6346 CPU, Windows 10 laptop. Corpus: ESCO dataset - v1.1.2, NL (copied several times over).

It seems that validation messages are kept in memory for printing after all files have been validated. Would it help and still be considered good practice to validate, output results and free memory per .ods file to prevent OutOfMemoryErrors with large batches or smaller batches with many messages per .ods?

This might also prevent a possible user irritation when the last file of a batch crashes the ODF Validator for some reason, thereby losing the in-memory validation results of the rest of the batch. In my -Xmx8000m case the ODF Validator had processed about 750 of the .ods files at the time of crashing. Due to the crash I saw no results and would have to re-run the process for all files.

It also seems a 'greener' solution, as you'd not have to re-run the ODF Validator against the full batch, only that last one, saving computing time and quite some RAM activity and memory swap disc io.

The exception: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.base/java.util.Arrays.copyOf(Arrays.java:3540) at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:100) at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:132) at java.base/java.io.InputStream.transferTo(InputStream.java:797) at org.openpreservation.format.zip.ZipFileProcessor.getEntryInputStream(ZipFileProcessor.java:118) at org.openpreservation.odf.pkg.PackageParserImpl.processEntry(PackageParserImpl.java:129) at org.openpreservation.odf.pkg.PackageParserImpl.processZipEntries(PackageParserImpl.java:109) at org.openpreservation.odf.pkg.PackageParserImpl.parsePackage(PackageParserImpl.java:100) at org.openpreservation.odf.pkg.PackageParserImpl.parsePackage(PackageParserImpl.java:70) at org.openpreservation.odf.validation.ValidatingParserImpl.parsePackage(ValidatingParserImpl.java:74) at org.openpreservation.odf.validation.Validator.validatePackage(Validator.java:107) at org.openpreservation.odf.validation.Validator.validate(Validator.java:83) at org.openpreservation.odf.apps.CliValidator.validatePath(CliValidator.java:68) at org.openpreservation.odf.apps.CliValidator.call(CliValidator.java:60) at org.openpreservation.odf.apps.CliValidator.call(CliValidator.java:35) at picocli.CommandLine.executeUserObject(CommandLine.java:2041) at picocli.CommandLine.access$1500(CommandLine.java:148) at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2461) at picocli.CommandLine$RunLast.handle(CommandLine.java:2453) at picocli.CommandLine$RunLast.handle(CommandLine.java:2415) at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2273) at picocli.CommandLine$RunLast.execute(CommandLine.java:2417) at picocli.CommandLine.execute(CommandLine.java:2170) at org.openpreservation.odf.apps.CliValidator.main(CliValidator.java:87)

carlwilson commented 2 months ago

You're absolutely right @RvanVeenendaal and I'll get a refactored fix out shortly. This was handy when debugging versions initially but is a problem with a production version.