Memory management for large PDFs

unidoc / unipdf

Golang PDF library for creating and processing PDF files (pure go)

https://unidoc.io

Other

2.54k stars 249 forks source link

Memory management for large PDFs #19

Open plaisted opened 6 years ago

plaisted commented 6 years ago

Currently it isn't possible to parse/edit/write PDFs if their size approaches the available memory on the computer. Unidoc parses the object tree and stores them in memory. Based on the architecture of a PDF file we should be able to parse (and for example extract text) from arbitrarily large PDFs that greatly exceed the memory capacity of the server. #128 could resolve a lot of this but would require pages/objects to be able to be freed from memory when no longer needed.

Additionally when writing large PDFs there would need to be something implemented to either cache objects to disk before writing the completed PDF or stream the PDF writing while each page is completed and releasing the memory back. This would get tricky with shared objects.

gunnsth commented 6 years ago

I think unidoc/#128 (lazy loading) would help with this. Also could put a size limit on the object cache, probably freeing the objects with the oldest load timestamp first. Then just load them again when needed.

Note: Lazy loading is supported in v3 with NewPdfReaderLazy().

gunnsth commented 5 years ago

Presumably most of the memory consumed is from large stream objects (would be good to verify based on a large corpus). A potential fix for this would be to change PdfObjectStream and unexport the Stream []byte and instead provide a higher level StreamAccessor / StreamCache which can provide the []byte directly if loaded in memory, or load from the file when needed.

The tricky part might be knowing whether the data is encoded or not and applying an encoding. If the stream is changed (e.g. has been encoded) then need to keep it in memory. (Or stored into some temporary storage where it could be re-accessed without staying in memory the entire time, such as boltdb or something similar).

seankungrubrik commented 3 months ago

Is there a solution to writing large PDF whose content doesn't fit in memory?

ipod4g commented 3 weeks ago

@seankungrubrik

We're currently working on implementing a solution for handling very large PDFs, including features like lazy writing. In the meantime, you can use methods such as NewPdfReaderLazy() or lazy mode for large images with SetLazy() to help manage large PDFs more efficiently.

If you have any questions or need further assistance, please let us know

Thanks