Couldn't extract title from a PDF with first page image

metebalci / pdftitle

a utility to extract the title from a PDF file

GNU General Public License v3.0

131 stars 21 forks source link

Couldn't extract title from a PDF with first page image #26

Open dufferzafar opened 2 years ago

dufferzafar commented 2 years ago

❯ pdftitle -p .\Downloads\test.pdf

Traceback (most recent call last):
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 701, in run
    title = get_title_from_file(args.pdf)
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 581, in get_title_from_file
    return get_title_from_io(raw_file)
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 476, in get_title_from_io
    dev.recover_last_paragraph()
  File "C:\Users\duffe\.local\pipx\venvs\pdftitle\lib\site-packages\pdftitle.py", line 341, in recover_last_paragraph
    raise Exception("current block is None, this might be a bug. " +
Exception: current block is None, this might be a bug. please report it together with the pdf file

# Using pdfminer's pdf2txt
➜ pdf2txt .\Downloads\test.pdf

C++/CLI in Action

# Using poppler/xpdf's pdftotext
➜ pdftotext .\Downloads\test.pdf -

C++/CLI in Action

Here is the file: test.pdf

metebalci commented 2 years ago

You have to use the --page-number argument. pdftitle does not check all the file, it only checks a single page (first page by default).

$ pdftitle -p test.pdf --page-number 2
C++/CLI in Action

dufferzafar commented 2 years ago

@metebalci Since it can't be known before-hand which PDFs will have title on first page.

Don't you think a better option would be to specify the last page that is checked? By default --last-page-number would be 1, so only 1st would be check. But I could set --last-page-number to something like 2 or 3 where title would be detected in the FIRST 3 pages.

BTW, I use pdftitle in a script that renames PDFs with their titles: https://github.com/dufferzafar/.scripts/blob/master/pdf-titles

metebalci commented 2 years ago

For an ultimate tool to extract a title from anywhere in a PDF file, this would be correct, but it is pretty difficult to do this I think with traditional methods (I mean without using something more smart from gestalt theory etc.). The main purpose of the tool is to extract titles of (peer-reviewed) articles and they do not have a cover page and they usually have a simple layout. On the other hand, I am not 100% sure but it might not be difficult to implement what you say and it might have some use. So I reopen the issue, I will check this when I do some implementation. So the changes can be:

deprecate but do not remove --page-number, defaults to 1
introduce --first-page-number, defaults to --page-number
introduce --last-page-number (inclusive), defaults to --first-page-number. If --last-page-number is different and the actual number of pages is less than this, I guess it makes sense to terminate the process silently at the end of the document.