pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.96k stars 930 forks source link

AttributeError: 'PSKeyword' object has no attribute 'decode' #871

Open Beebruna opened 1 year ago

Beebruna commented 1 year ago

Bug report

I'm trying to extract text from the following pdf, but the following occurs:

import requests
from io import StringIO, BytesIO
from pdfminer.high_level import extract_text_to_fp

url = 'https://geoobras.tcm.pa.gov.br/Cidadao/Licitacao/Download?unidadegestora=71808&id=L2731_57534&extensao=pdf'
response = requests.get(url)

output_string = StringIO()
extract_text_to_fp(BytesIO(response.content), output_string)

Output:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\debora.morais\Documents\Projetos VSCode\extracao-ocr\env\lib\site-packages\pdfminer\high_level.py", line 132, in extract_text_to_fp
    interpreter.process_page(page)
  File "C:\Users\debora.morais\Documents\Projetos VSCode\extracao-ocr\env\lib\site-packages\pdfminer\pdfinterp.py", line 997, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "C:\Users\debora.morais\Documents\Projetos VSCode\extracao-ocr\env\lib\site-packages\pdfminer\pdfinterp.py", line 1014, in render_contents
    self.init_resources(resources)
  File "C:\Users\debora.morais\Documents\Projetos VSCode\extracao-ocr\env\lib\site-packages\pdfminer\pdfinterp.py", line 384, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "C:\Users\debora.morais\Documents\Projetos VSCode\extracao-ocr\env\lib\site-packages\pdfminer\pdfinterp.py", line 234, in get_font
    font = self.get_font(None, subspec)
  File "C:\Users\debora.morais\Documents\Projetos VSCode\extracao-ocr\env\lib\site-packages\pdfminer\pdfinterp.py", line 225, in get_font
    font = PDFCIDFont(self, spec)
  File "C:\Users\debora.morais\Documents\Projetos VSCode\extracao-ocr\env\lib\site-packages\pdfminer\pdffont.py", line 1054, in __init__
    cid_registry = resolve1(self.cidsysteminfo.get("Registry", b"unknown")).decode(
AttributeError: 'PSKeyword' object has no attribute 'decode'
Andy197527 commented 6 days ago

这个问题你解决了吗?

Beebruna commented 6 days ago

@Andy197527 No. I haven't used this library for a long time, so I never tried to solve this problem again.