mkl-public / testarea-pdfbox2

Test area for public PDFBox v2 issues on stackoverflow etc
Apache License 2.0
82 stars 44 forks source link

cosstream has been closed and cannot be read. perhaps its enclosing pddocument has been closed #13

Closed MattNot closed 2 years ago

MattNot commented 2 years ago

Hi!

Thanks for the tool. I've got an issue when calling the merge method, sometimes (not always but quite frequently) it throws an error saying "cosstream has been closed and cannot be read. perhaps its enclosing pddocument has been closed" on 50th line of DenseMerge and same with VeryDenseTool.

My usage is this:

PdfVeryDenseMergeTool pdfDenseMergeTool = new PdfVeryDenseMergeTool(PDRectangle.A4, dim1, dim2, dim3);

ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();

pdfDenseMergeTool.merge(byteArrayOutputStream, listOfDocuments);

Any idea of what is happening SOMETIMES? It feels like a problem of synch between the streams but idk

The problem is that is not deterministic, with exactly the same input it can throw or not the error

mkl-public commented 2 years ago

Any idea of what is happening SOMETIMES? It feels like a problem of synch between the streams but idk

That error "cosstream has been closed and cannot be read. perhaps its enclosing pddocument has been closed" during a document.save call usually means that for some stream (often a page content stream) the document it originally is from has been closed; and if that only happens sometimes, chances are that that document has not been called by an explicit close call but implicitly by garbage collection.

In the context at hand this might have happened if one of the documents in listOfDocuments has been created by a method like this:

PDDocument getDocument() {
    PDDocument source = PDDocument.load(""); // load some existing PDF into a new PDDocument
    PDDocument result = new PDDocument();
    ... // add one or more pages from source to result
    return result;
}

After that method has returned, the PDDocument in source is not referenced anymore and may anytime be found and processed by garbage collection.

Thus, if pdfDenseMergeTool.merge is executed before the garbage collection processes that PDDocument, all is well but if it is executed thereafter, the data of the content streams of pages from there cannot be accessed anymore, so the save must fail.

To prevent this you can collect all such interim PDDocument instances yourself, so they cannot be processed by garbage collection, and only dereference that collection after the merge call.

Alternatively you can try to use importPage instead of addPage; the code of that method looks a bit incomplete to me, though.

MattNot commented 2 years ago

Hi! Thank you for your response.

You were right, the Garbage Collector was closing the documents!

Solved by creating a "dummy" list of instances of the documents just to mantain them referenced until I call the merge method.

Thank you very much, the issue can be closed for me.