py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.41k stars 1.42k forks source link

Fails to convert date to date object if not in correct ISO format #2908

Closed jojo2357 closed 1 month ago

jojo2357 commented 1 month ago

Environment

$ python -m platform
Linux-6.9.3-76060903-generic-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.0.1, crypt_provider=('cryptography', '3.4.8'), PIL=9.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
reader = PdfReader('main.pdf')
reader.metadata.creation_date

The original PDF that I discovered having this stupid format contains sensitive information and was big. So instead I re-created it with the LaTeX below. I verified that the data is exactly the same between the original problem file and the generated one.

\documentclass[11pt]{article}
\pdfinfo{
   /Author (jojo2357)
   /Title  (borked)
   /CreationDate (10/2/2024 01:48:09)
}
\begin{document}
    Hello
\end{document}

main.pdf

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/pypdf/_doc_common.py", line 212, in creation_date
    return parse_iso8824_date(self._get_text(DI.CREATION_DATE))
  File "/usr/local/lib/python3.10/dist-packages/pypdf/_utils.py", line 105, in parse_iso8824_date
    raise ValueError(f"Can not convert date: {orgtext}")
ValueError: Can not convert date: 10/2/2024 01:48:09

Thoughts

While this date is obviously not the correct format, it would be nice if other formats were checked for automatically for me just in case someone sends me a borked PDF.

stefan6419846 commented 1 month ago

Thanks for your report. According to section 7.9.4 of the PDF 2.0 specification, your datetime format does not follow the standard (ISO 8824-1).

I am not sure whether providing support for any sort of date format really makes sense - there are tons of variations.

As a first step, you might want to get in touch with the author/creator of the PDF file to inform them of the standard violation. Otherwise, you should still be able to implement your own logic based upon creation_date_raw if the exception is raised.

jojo2357 commented 1 month ago

Perhaps adding a parameter like "fallback datetime formats" would work? that way i dont need to completely re-implement the whole method if I know I might have a bad format.

In the meantime I did just copy the source and am using the raw with my extra format.

pubpub-zz commented 1 month ago

you could just import dateutil: dateutil.parser.parse(reader.metadata.creation_date_raw)

I'm personnally not inclined to add a dependency on this library to cope with invalid formats. For the same reason adding a format parameter does not seem a good idea as you have an easy solution to cope with your issue

stefan6419846 commented 1 month ago

In the meantime I did just copy the source and am using the raw with my extra format.

Catching the exception is still a valid approach which does not require copying the whole function. As the function is internal, simply adding a fallback datetime format does not really work.

As already mentioned, the format violates the specification to quite some extent and simple workarounds are already possible here, thus I am going to close this issue as not planned.