jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.
https://borbpdf.com/
Other
3.4k stars 147 forks source link

loading a large PDF (around 100mb) is very slow #169

Closed pig123123 closed 1 year ago

pig123123 commented 1 year ago

Describe the bug loading a large PDF (around 100mb) is very slow

To Reproduce I just want to loads a PDF and then split it.


import typing
from borb.pdf import Document
from borb.pdf import PDF

def main():

    # read the Document
    print("1")
    doc: typing.Optional[Document] = None
    print("2")
    with open("Calculus.pdf", "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle)

    # check whether we have read a Document
    assert doc is not None

    print("finish")

if __name__ == "__main__":
    main()

Expected behaviour It should be an easy task for borb. But I've wait nearly half an hour and it haven't finished it's loading.

Screenshots 截屏2023-07-04 下午8 22 49

Desktop (please complete the following information):

Additional context

jorisschellekens commented 1 year ago

Your PDF is corrupted in some way. There is an online tool to check whether your PDF adheres to the standard. You can find their website here and the online tool here.

Your PDF crashed their tool. image

borb seems to spend its time in (attempting to) read the cross-reference table. The cross-reference table is sort of the general lookup for a PDF. If a page needs something (like a Font, or an Image) it can simply say "I need object 38". And the cross-reference table (abbreviated to xref) will reply with "object 38 starts at byte 39403 in this file".

In short, the xref is extremely important. And in the case of this document, probably corrupt.

Other PDF tools might handle this in different ways. They might decide to abandon the xref and attempt to rebuild it. And that logic is present in borb as well. But it doesn't get triggered. So it seems like your xref is valid enough not to trigger a rebuild, but also corrupt enough to mess with borb and cause an infinite loop somewhere.

Either way, I am closing this issue. As the proverb goes, garbage in --> garbage out.

Kind regards, Joris Schellekens