loading a large PDF (around 100mb) is very slow

jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.

Other

3.4k stars 147 forks source link

import typing from borb.pdf import Document from borb.pdf import PDF def main(): # read the Document print("1") doc: typing.Optional[Document] = None print("2") with open("Calculus.pdf", "rb") as in_file_handle: doc = PDF.loads(in_file_handle) # check whether we have read a Document assert doc is not None print("finish") if __name__ == "__main__": main()

Your PDF is corrupted in some way. There is an online tool to check whether your PDF adheres to the standard. You can find their website here and the online tool here.

Your PDF crashed their tool.

borb seems to spend its time in (attempting to) read the cross-reference table. The cross-reference table is sort of the general lookup for a PDF. If a page needs something (like a Font, or an Image) it can simply say "I need object 38". And the cross-reference table (abbreviated to xref) will reply with "object 38 starts at byte 39403 in this file".

In short, the xref is extremely important. And in the case of this document, probably corrupt.

Other PDF tools might handle this in different ways. They might decide to abandon the xref and attempt to rebuild it. And that logic is present in borb as well. But it doesn't get triggered. So it seems like your xref is valid enough not to trigger a rebuild, but also corrupt enough to mess with borb and cause an infinite loop somewhere.

Either way, I am closing this issue. As the proverb goes, garbage in --> garbage out.

Kind regards, Joris Schellekens

jorisschellekens / borb

loading a large PDF (around 100mb) is very slow #169