jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

`extract_text(extra_attrs=["size"])` raises a parsing error #1030

Closed RitaMarques closed 8 months ago

RitaMarques commented 8 months ago

Describe the bug

While reading a simple PDF using the method extract_text, passing the list ["size", "fontname"] to extra_attrs, it raises the error:

    404 def extract_text(self, **kwargs: Any) -> str:
--> 405     return self.get_textmap(**kwargs).as_string

TypeError: unhashable type: 'list'

Code to reproduce the problem

import pdfplumber

with pdfplumber.open("Condioes_Gerais_Abertura_Conta.pdf") as pdf:
    page = pdf.pages[0]
    print(page.extract_text(layout=True, use_text_flow=True, extra_attrs=["size", "fontname"]))

PDF file

Condioes_Gerais_Abertura_Conta.pdf

Screenshots

image

Environment

jsvine commented 8 months ago

Hi @RitaMarques, and thanks for flagging this, which was indeed a bug.

Although there had been a test for Page.extract_words(extra_args=[...]), there wasn't yet one for Page.extract_text(extra_args=[...]), and the addition of a caching layer caused this error to be thrown, since list kwargs can't be hashed for the cache.

This is now solved in 0bfffc2 by pre-processing the kwargs to convert lists into tuples.

For now (i.e., before the next release), you can solve your problem by defining the extra_attrs as a tuple instead of a list:

page.extract_text(
  layout=True,
  use_text_flow=True,
  extra_attrs=("size", "fontname")
)

Let us know if that doesn't work for you.

RitaMarques commented 8 months ago

Hi @jsvine, thanks for getting back to me! It's solved ;)