py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.24k stars 1.4k forks source link

Is there a way to only load the PDF metadata, instead of the full document? #1643

Open hf-kklein opened 1 year ago

hf-kklein commented 1 year ago

Explanation

I'm using PyPdf2 to extract metadata from PDF documents. For my usecase I'm only interested in the documents metadata, not the content itself.

Code Example

from pypdf import PdfReader
# bytes_of_pdf_file = requests.get("https://my_url.inv/link/of/pdf_file.pdf").content
reader = PdfReader(io.BytesIO(bytes_of_pdf_file)) # this takes a while for large documents
return reader.metadata # <-- only interested in the metadata, don't care about the rest

Performance profiling of my application shows, that it spends significant time reading the PDF documents, which might be several hundred pages long. It seems reasonable but actually in my case I'm not interested in all the pages of the document at all. I wonder if there is any shortcut to access the metadata without reading the entire document.

from pypdf import PdfReader
reader = PdfReader(io.BytesIO(bytes_of_pdf_file), lazy_loading=True) # initializes the reader but doesn't read the entire file yet -> super fast
return reader.metadata
MartinThoma commented 1 year ago

When the PdfReader is initialized, we do this:

        if isinstance(stream, (str, Path)):
            with open(stream, "rb") as fh:
                stream = BytesIO(fh.read())
        self.read(stream)
        self.stream = stream

For getting the metadata, we need:

  1. The trailer
  2. To get the trailer, we need the xref table
  3. To get the xref table, we should only need to read the last 3 lines of the PDF document. As we chose a buffer size of 8kB, that would be the last 8kB of the file
  4. Then we need to read whatever it takes to go to the metadata

So I guess we will still wildly jump around in the file. Just because of the PDF structure. That essentially means we need to read the whole document if we want to get the metadata

However, it should be possible to read the metadata without parsing the whole document.

It's possible and rather easy if you write just that code. Within the remaining functions of pypdf it's not so straight-forward.

pubpub-zz commented 1 year ago

@hf-kklein to complete @MartinThoma, I did some checks under windows using SysInternals/ProcMon and I can confirm the whole document is not read (just some jumps) I think what you may talk about is Linearized PDFs (which are not all PDFs ), however metadata are not reported to be within this priority data and may need a big buffer. Can you confirm my analysis