Open hf-kklein opened 1 year ago
When the PdfReader
is initialized, we do this:
if isinstance(stream, (str, Path)):
with open(stream, "rb") as fh:
stream = BytesIO(fh.read())
self.read(stream)
self.stream = stream
For getting the metadata, we need:
So I guess we will still wildly jump around in the file. Just because of the PDF structure. That essentially means we need to read the whole document if we want to get the metadata
However, it should be possible to read the metadata without parsing the whole document.
It's possible and rather easy if you write just that code. Within the remaining functions of pypdf it's not so straight-forward.
@hf-kklein to complete @MartinThoma, I did some checks under windows using SysInternals/ProcMon and I can confirm the whole document is not read (just some jumps) I think what you may talk about is Linearized PDFs (which are not all PDFs ), however metadata are not reported to be within this priority data and may need a big buffer. Can you confirm my analysis
Explanation
I'm using PyPdf2 to extract metadata from PDF documents. For my usecase I'm only interested in the documents metadata, not the content itself.
Code Example
Performance profiling of my application shows, that it spends significant time reading the PDF documents, which might be several hundred pages long. It seems reasonable but actually in my case I'm not interested in all the pages of the document at all. I wonder if there is any shortcut to access the metadata without reading the entire document.