sunmingtao / sample-code

3 stars 4 forks source link

PDFBox java.lang.OutOfMemoryError: Java heap space when image is too large #348

Closed sunmingtao closed 4 months ago

sunmingtao commented 4 months ago

I am getting the error below, even though I have set VM args: -Xms512m -Xmx4g And I am pretty sure I didn't even use 2GB of memory.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.base/java.util.Arrays.copyOf(Arrays.java:3745)
    at java.base/java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:211)
    at org.apache.pdfbox.pdmodel.graphics.image.LosslessFactory$PredictorEncoder.preparePredictorPDImage(LosslessFactory.java:603)
    at org.apache.pdfbox.pdmodel.graphics.image.LosslessFactory$PredictorEncoder.encode(LosslessFactory.java:499)
    at org.apache.pdfbox.pdmodel.graphics.image.LosslessFactory.createFromImage(LosslessFactory.java:90)
    at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.createFromFileByExtension(PDImageXObject.java:251)
    at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.createFromFile(PDImageXObject.java:193)
    at au.gov.nla.banjo.services.PdfBoxPlayground.addPageToDocument(PdfBoxPlayground.java:126)
    at au.gov.nla.banjo.services.PdfBoxPlayground.generatePDFs(PdfBoxPlayground.java:60)
    at au.gov.nla.banjo.services.PdfBoxPlayground.main(PdfBoxPlayground.java:36)
sunmingtao commented 4 months ago

Switching to PDImageXObject.createFromFileByContent(file, document) and finding that it fixes your issue is quite insightful. This method differs from the others in how it handles memory during the image loading process. Let's explore why this change might have resolved the memory issue:

Streamed Content: The createFromFileByContent method likely uses a more memory-efficient way to handle image data, possibly by streaming the content directly into the PDF document without fully loading it into the heap space. This approach can significantly reduce memory consumption, especially for large images.

Reduced Memory Footprint: By avoiding loading the entire image into memory, this method minimizes the application's overall memory footprint. This is particularly beneficial when working with large files or a large number of files, as it helps prevent the heap space from being quickly exhausted.

Garbage Collection Efficiency: Using a more memory-efficient method allows the Java Virtual Machine (JVM) garbage collector to manage memory more effectively. Less memory allocation for temporary objects means the garbage collector can reclaim memory spaces more efficiently, reducing the likelihood of OutOfMemoryError.

Optimized Internal Handling: Apache PDFBox might have optimized internal handling for the createFromFileByContent method, such as better buffer management or on-the-fly compression, which would further alleviate memory pressure.