metebalci / pdftitle

a utility to extract the title from a PDF file
GNU General Public License v3.0
131 stars 21 forks source link

"TypeError: 'NoneType' object is not subscriptable" almost every time I make inference #21

Closed nicolas-gervais closed 3 years ago

nicolas-gervais commented 3 years ago
Traceback (most recent call last):
  File "c:\users\ngervais\anaconda3\envs\montrium\lib\site-packages\pdftitle.py", line 669, in run
    title = get_title_from_file(args.pdf)
  File "c:\users\ngervais\anaconda3\envs\montrium\lib\site-packages\pdftitle.py", line 557, in get_title_from_file
    return get_title_from_io(raw_file)
  File "c:\users\ngervais\anaconda3\envs\montrium\lib\site-packages\pdftitle.py", line 452, in get_title_from_io
    dev.recover_last_paragraph()
  File "c:\users\ngervais\anaconda3\envs\montrium\lib\site-packages\pdftitle.py", line 340, in recover_last_paragraph
    if len(self.current_block[4]) > 0:
TypeError: 'NoneType' object is not subscriptable
metebalci commented 3 years ago

Can you share how you are using it and with which pdf file ?

nicolas-gervais commented 3 years ago

I found that if one of the pages doesn't contain text, this occurs. I'm able to avoid these documents like this:

import pdftotext

@staticmethod
 def contains_text(filename):
        with open(filename, "rb") as f:
            pdf = pdftotext.PDF(f)
        return all([text != '\x0c' for text in pdf])
metebalci commented 3 years ago

I cannot reproduce the error, I checked both an empty pdf and a pdf with an empty page between non-empty pages. If it is possible you share the pdf I can check that one. Maybe there is a different problem.

seamustuohy commented 3 years ago

Here are some example files which fail. From some very cursory debugging it looks like the error was introduced when the eliot algorithm was added. TextOnlyDevices had some changes made to it that fail when certain assumptions are not met. Specifically, the below files are constructed in a manner where process_string never gets run when the PDF is being parsed. This in turn means that draw_cid never sets self.current_block. That leads recover_last_paragraph to fail when it tries to pull the fifth item from self.current_block, which is still set as None.

SYSTEM V - application binary interface.pdf

Taking_The_Pulse_Of_Hacking-A_Risk_Basis_For_Security_Research_2018.pdf

anti-reverse-engineering-linux.pdf

metebalci commented 3 years ago

Thanks for the pdfs. There are different issues with each of these.

metebalci commented 3 years ago

This is a note to myself, I will try to improve the error logging a bit in the next version, so it will be a little more human friendly messages when errors happen.

seamustuohy commented 3 years ago

I have been processing hundreds of PDF files recently and have come across a large number of these. If you would like I can provide you more. There are also a range of unicode and other cid encoding errors I've come across that I've not reported since they seem to be with the underlying PDFMiner library. But, I'd be happy to share a range of PDF's that cause the library to fail out in different ways if you would like.

metebalci commented 3 years ago

Thanks, I will get in touch if I need more examples.

Meanwhile I will create two different issues for the cases above, and also close this issue as it was first opened for empty pdfs which I could not reproduce.

metebalci commented 3 years ago

@seamustuohy update about the files you mentioned: