Question / Comment: Metadata extraction

andrei-volkau commented 3 years ago

I was able to pull PDF metadata using Tika-Python.

import tika

tika.initVM()
from tika import parser
parsed = parser.from_file('sample.pdf')
print(parsed["metadata"])

Question 1. I am wondering whether it is possible to pull PDF metadata using PyMuPDF as well.

The extracted metadata contains information about the author, the title, etc. Example 1: PDF Document ___ Extracted metadata

Example 2: PDF Document ___ Extracted metadata

Question 2. I am wondering where It is possible to read more about the kinds of fields that might be included in PDF's metadata. Specifically, I am interested in whether the author and the title fields are included all the time. I tested just 2 docs. So the author and the title fields were included while having correct values. Does it mean that the author and the title fields might be extracted all the time in the case of textbooks/papers?

Thank you in advance for any info!

JorjMcKie commented 3 years ago

Metadata are readily accessible in PyMuPDF - not only for PDF but for all supported document types. Independently from the doc type, these metadata are all presented via the same dictionary:

>>> doc1=fitz.open("Design+challenges+and+misconceptions+in+named+entity+recognition.pdf")
>>> pprint(doc1.metadata)
{'author': 'Lev Ratinov ; Dan Roth',
 'creationDate': "D:20090514230638-06'00'",
 'creator': 'LaTeX with hyperref package',
 'encryption': None,
 'format': 'PDF 1.4',
 'keywords': '',
 'modDate': "D:20090514230638-06'00'",
 'producer': 'pdfTeX-1.40.9',
 'subject': 'CoNLL 2009',
 'title': 'Design Challenges and Misconceptions in Named Entity Recognition',
 'trapped': ''}
>>> doc2=fitz.open("Michael+J.+Sandel+-+Justice_+What's+the+Right+Thing+to+Do_+(2009,+Allen+Lane)+-+libgen.lc.pdf")
>>> pprint(doc2.metadata)
{'author': 'Michael Sandel',
 'creationDate': "D:20200825105425+00'00'",
 'creator': '',
 'encryption': None,
 'format': 'PDF 1.4',
 'keywords': '',
 'modDate': "D:20200825125426+02'00'",
 'producer': '',
 'subject': '',
 'title': 'Justice',
 'trapped': ''}
>>>

The dict keys include every info type defined in PDF spec section "10.2.1 Document Information Dictionary" on page 843 in https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf. Since the latest version I am returning empty strings for missing information (except for encryption). With a few exceptions, these field can be filled with new information and stored back to the PDF. There is no guarantee nor check whether they will be there, or even whether they contain meaningful information if not empty. I have seen date field fields filled with some crab. As is typical for an old concept, PDF has introduced alternatives to several original concepts. So there is now also the option to store XML-formatted metadata - not only on the document level, but also for other objects like pages. I am supporting extraction and storage of document-level (only) XML metadata - treating them as simple strings both ways. Because of my policy to not introduce dependencies on outside packages, there is nothing in PyMuPDF to interpret / modify extracted XML data. So there is nothing that prevents storing just anything - even something not in XML syntax. But you can certainly use e.g. lxml, to interpret and modify and store back the result.

Other stuff I have seen from the TIKA output is general info on the PDF - not strictly metadata:

permission: a bitfield stored as an integer doc.permissions. It contains this information.
encryption: full support
characters per page: obviously easy to determine via page.getText().
optional content: exists if doc.get_ocgs() is not an empty dict. Otherwise contains details about each optional contents group.
marked content: currently not supported

Interesting document-level info might be, whether JavaScript is used at all, or whether there are document-level embedded files. The latter is fully supported. The presence of JavaScript can be detected via scanning the (compressed) source of all PDF xrefs and check if any contains the string "/Type/JavaScript".

andrei-volkau commented 3 years ago

Hi @JorjMcKie , many thanks for the detailed comments! I am closing the question.

pymupdf / PyMuPDF

Question / Comment: Metadata extraction #738