"TypeError: 'NoneType' object is not subscriptable" almost every time I make inference

nicolas-gervais commented 3 years ago

Traceback (most recent call last):
  File "c:\users\ngervais\anaconda3\envs\montrium\lib\site-packages\pdftitle.py", line 669, in run
    title = get_title_from_file(args.pdf)
  File "c:\users\ngervais\anaconda3\envs\montrium\lib\site-packages\pdftitle.py", line 557, in get_title_from_file
    return get_title_from_io(raw_file)
  File "c:\users\ngervais\anaconda3\envs\montrium\lib\site-packages\pdftitle.py", line 452, in get_title_from_io
    dev.recover_last_paragraph()
  File "c:\users\ngervais\anaconda3\envs\montrium\lib\site-packages\pdftitle.py", line 340, in recover_last_paragraph
    if len(self.current_block[4]) > 0:
TypeError: 'NoneType' object is not subscriptable

metebalci commented 3 years ago

Can you share how you are using it and with which pdf file ?

nicolas-gervais commented 3 years ago

I found that if one of the pages doesn't contain text, this occurs. I'm able to avoid these documents like this:

import pdftotext

@staticmethod
 def contains_text(filename):
        with open(filename, "rb") as f:
            pdf = pdftotext.PDF(f)
        return all([text != '\x0c' for text in pdf])

metebalci commented 3 years ago

I cannot reproduce the error, I checked both an empty pdf and a pdf with an empty page between non-empty pages. If it is possible you share the pdf I can check that one. Maybe there is a different problem.

seamustuohy commented 3 years ago

Here are some example files which fail. From some very cursory debugging it looks like the error was introduced when the eliot algorithm was added. TextOnlyDevices had some changes made to it that fail when certain assumptions are not met. Specifically, the below files are constructed in a manner where process_string never gets run when the PDF is being parsed. This in turn means that draw_cid never sets self.current_block. That leads recover_last_paragraph to fail when it tries to pull the fifth item from self.current_block, which is still set as None.

SYSTEM V - application binary interface.pdf

Taking_The_Pulse_Of_Hacking-A_Risk_Basis_For_Security_Research_2018.pdf

anti-reverse-engineering-linux.pdf

metebalci commented 3 years ago

Thanks for the pdfs. There are different issues with each of these.

anti-reverse-engineering-linux contains a single XObject (single Do operator) embedded into this pdf. My understanding is XObject can be many things, it is like an embedded PDF inside this PDF. If this embedded XObject is a normal PDF like others, it might be possible to extract the title from that, however it is not very clear to me yet how to work on these. I will check but it might take some time if it is possible to support this.
Taking_The.. contains 14 XObjects (so 14 Do operators) and also some other (probably unrelated to pdftitle) operators. I first thought maybe each page is another XObject but there are more than 14 pages in the document. So this is similar to the issue above, but a little different, I think.
SYSTEM V... looks like a regular PDF file which pdftitle should support. There is something strange (in the sense I havent seen before) in the text transformation or state in this file, so none of the characters in the first page is taken into account, then this causes the error you mention (no current_block). I am checking this, but I need to remember or understand the text transformation again so not sure how easy it will be to fix it.

metebalci commented 3 years ago

This is a note to myself, I will try to improve the error logging a bit in the next version, so it will be a little more human friendly messages when errors happen.

seamustuohy commented 3 years ago

I have been processing hundreds of PDF files recently and have come across a large number of these. If you would like I can provide you more. There are also a range of unicode and other cid encoding errors I've come across that I've not reported since they seem to be with the underlying PDFMiner library. But, I'd be happy to share a range of PDF's that cause the library to fail out in different ways if you would like.

metebalci commented 3 years ago

Thanks, I will get in touch if I need more examples.

Meanwhile I will create two different issues for the cases above, and also close this issue as it was first opened for empty pdfs which I could not reproduce.

metebalci commented 3 years ago

@seamustuohy update about the files you mentioned:

anti-reverse-engineering-linux: the first page is an image, so no text to extract. you can send --page-number 2 with the new version, and it should work.
Taking_The..: again the first page is an image but also the 3rd page has the section title (Introduction) with a bigger font than caption at the top of the page), using another algorithm might work for this, but it is a very special case.
SYSTEM V...: should work with the new version but there is an issue with the space, between the first line and the second line. I want to fix the spacing issue in general but it will probably take some time.

metebalci / pdftitle

"TypeError: 'NoneType' object is not subscriptable" almost every time I make inference #21