jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.31k stars 647 forks source link

TypeError: unsupported operand type(s) for -: 'float' and 'NoneType' #726

Closed loganathanspr closed 1 year ago

loganathanspr commented 2 years ago

Describe the bug

I am extracting annotations from a pdf file. It is giving me the TypeError when accessing the .annots. When I updated each annotations manually (just adding/deleting one extra character ), it didn't give me this error. I am suspecting the original text encoding of the annotation is different than the one expected by the pdfplumber. Does pdfplumber have any strict assumption on the text encoding?

Code to reproduce the problem

def get_pdf_annotations(pdf_path: str):
  """Get all annotations (by page) for a pdf file.

  Args:
    pdf_path (str): Path to pdf file.

  Returns:
    List of annotations: List index corresponds to page numbers (starting from 0)
    and each list item is a list of annotations found for that page.
  """
  annots_all_pages = []
  with pdfplumber.open(pdf_path) as pdf:
    pages = pdf.pages
    for p in pages:
      page_annots = []
      texts = []
      colors = []      
      annotations = p.annots
     # ...
     # ....
  return annots_all_pages

Screenshots

Screenshot 2022-09-07 at 14 16 50

Environment

jsvine commented 2 years ago

Thanks for flagging @loganathanspr! Looking at the stacktrace, my best guess is that the annotation has an undefined bounding box. (Hence why it'd get such an error on line 167, where the stacktrace is pointing.) But it's a bit difficult to know for sure, or to test a fix, without seeing the actual PDF. Are you able to share that?

jsvine commented 1 year ago

Hi @loganathanspr, just checking back: Are you able to provide the original PDF?

loganathanspr commented 1 year ago

Hi @jsvine, sorry for coming back so late, unfortunately I am not able to share the pdf. It happens with certain annotations, and it seems to work with PyMuPDF, that's our current workaround. Will share the document if I see that in other places. Thank you!

jsvine commented 1 year ago

Thanks, @loganathanspr. Closing this for now, but feel free to reopen if you come across another, shareable PDF that raises similar errors.