jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.
https://borbpdf.com/
Other
3.4k stars 147 forks source link

Loading corrupted PDF files is incredibly slow #151

Closed PisonJay closed 1 year ago

PisonJay commented 1 year ago

PDF File: https://in2core.com/qtake-docs-v2.0/pdf/QTAKE_Pro_User_Manual_2_0.pdf

code:

from borb.pdf import PDF
with open('QTAKE_Pro_User_Manual_2_0.pdf','rb') as f:
    doc = PDF.loads(f)
with open('repaired.pdf', 'wb') as f:
    PDF.dumps(f, doc)

after ctrl-C:

^CTraceback (most recent call last):
  File "/home/user/main.py", line 3, in <module>
    doc = PDF.loads(f)
  File "/home/user/.local/lib/python3.10/site-packages/borb/pdf/pdf.py", line 54, in loads
    document: Document = ReadAnyObjectTransformer().transform(
  File "/home/user/.local/lib/python3.10/site-packages/borb/io/read/any_object_transformer.py", line 100, in transform
    return super().transform(
  File "/home/user/.local/lib/python3.10/site-packages/borb/io/read/transformer.py", line 123, in transform
    out = h.transform(
  File "/home/user/.local/lib/python3.10/site-packages/borb/io/read/reference/xref_transformer.py", line 92, in transform
    self._read_xref(context)
  File "/home/user/.local/lib/python3.10/site-packages/borb/io/read/reference/xref_transformer.py", line 294, in _read_xref
    self._read_xref(context, initial_offset=prev)
  File "/home/user/.local/lib/python3.10/site-packages/borb/io/read/reference/xref_transformer.py", line 294, in _read_xref
    self._read_xref(context, initial_offset=prev)
  File "/home/user/.local/lib/python3.10/site-packages/borb/io/read/reference/xref_transformer.py", line 271, in _read_xref
    most_recent_xref.read(src, tok)
  File "/home/user/.local/lib/python3.10/site-packages/borb/pdf/xref/rebuilt_xref.py", line 142, in read
    bytes_in_pdf[i] == 116  # 't'
KeyboardInterrupt

seems borb needs to speed up the xref rebuilding process.

jorisschellekens commented 1 year ago

I already have a test related to timing/speed in the tests folder.

I am aware of the fact that borb is not always the fastest at reading a PDF. I can imagine the problem gets compounded when you add a corrupt PDF on top of that.

Do you have any concrete solutions? You can time the execution with a profiler and check where borb spends most of its time.

I look forward to hearing from you.

Kind regards, Joris Schellekens

jorisschellekens commented 1 year ago

I haven't heard from you in 2 weeks. I am closing this issue.

Kind regards, Joris Schellekens