Open Beebruna opened 1 year ago
Bug report
I'm trying to extract text from the following pdf, but the following occurs:
import requests from io import StringIO, BytesIO from pdfminer.high_level import extract_text_to_fp url = 'https://geoobras.tcm.pa.gov.br/Cidadao/Licitacao/Download?unidadegestora=71808&id=L2731_57534&extensao=pdf' response = requests.get(url) output_string = StringIO() extract_text_to_fp(BytesIO(response.content), output_string) Output: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\debora.morais\Documents\Projetos VSCode\extracao-ocr\env\lib\site-packages\pdfminer\high_level.py", line 132, in extract_text_to_fp interpreter.process_page(page) File "C:\Users\debora.morais\Documents\Projetos VSCode\extracao-ocr\env\lib\site-packages\pdfminer\pdfinterp.py", line 997, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "C:\Users\debora.morais\Documents\Projetos VSCode\extracao-ocr\env\lib\site-packages\pdfminer\pdfinterp.py", line 1014, in render_contents self.init_resources(resources) File "C:\Users\debora.morais\Documents\Projetos VSCode\extracao-ocr\env\lib\site-packages\pdfminer\pdfinterp.py", line 384, in init_resources self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec) File "C:\Users\debora.morais\Documents\Projetos VSCode\extracao-ocr\env\lib\site-packages\pdfminer\pdfinterp.py", line 234, in get_font font = self.get_font(None, subspec) File "C:\Users\debora.morais\Documents\Projetos VSCode\extracao-ocr\env\lib\site-packages\pdfminer\pdfinterp.py", line 225, in get_font font = PDFCIDFont(self, spec) File "C:\Users\debora.morais\Documents\Projetos VSCode\extracao-ocr\env\lib\site-packages\pdfminer\pdffont.py", line 1054, in __init__ cid_registry = resolve1(self.cidsysteminfo.get("Registry", b"unknown")).decode( AttributeError: 'PSKeyword' object has no attribute 'decode'
这个问题你解决了吗?
@Andy197527 No. I haven't used this library for a long time, so I never tried to solve this problem again.
Bug report
I'm trying to extract text from the following pdf, but the following occurs: