jesselau76 / ebook-GPT-translator

Enjoy reading with your favorite style.
https://jesselau.com
MIT License
1.63k stars 210 forks source link

处理PDF文件时遇到了无效的交叉引用(XRef)表 #27

Open yoyicue opened 1 year ago

yoyicue commented 1 year ago

解析这个optimized过的pdf报错, 在deepl里面是可以正常处理的。 https://assets.ctfassets.net/95kuvdv8zn1v/44FqPJmYPZRwiZN2socdOK/14f5eb025d87a452100d80f513567f2a/Cruise_Impact_Report_-_2022-optimized.pdf

Converting PDF to text:   0% 0/10 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/pdfminer/pdfdocument.py", line 722, in __init__
    self.read_xref_from(parser, pos, self.xrefs)
  File "/usr/local/lib/python3.9/dist-packages/pdfminer/pdfdocument.py", line 1000, in read_xref_from
    xref.load(parser)
  File "/usr/local/lib/python3.9/dist-packages/pdfminer/pdfdocument.py", line 282, in load
    raise PDFNoValidXRef("Invalid PDF stream spec.")
pdfminer.pdfdocument.PDFNoValidXRef: Invalid PDF stream spec.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/drive/MyDrive/ebook-GPT-translator/text_translation.py", line 347, in <module>
    text = convert_pdf_to_text(filename,startpage,endpage)
  File "/content/drive/MyDrive/ebook-GPT-translator/text_translation.py", line 221, in convert_pdf_to_text
    end_page = get_total_pages(pdf_filename)
  File "/content/drive/MyDrive/ebook-GPT-translator/text_translation.py", line 217, in get_total_pages
    document = PDFDocument(parser)
  File "/usr/local/lib/python3.9/dist-packages/pdfminer/pdfdocument.py", line 727, in __init__
    newxref.load(parser)
  File "/usr/local/lib/python3.9/dist-packages/pdfminer/pdfdocument.py", line 241, in load
    (_, obj) = parser.nextobject()
  File "/usr/local/lib/python3.9/dist-packages/pdfminer/psparser.py", line 609, in nextobject
    (pos, token) = self.nexttoken()
  File "/usr/local/lib/python3.9/dist-packages/pdfminer/psparser.py", line 526, in nexttoken
    self.fillbuf()
  File "/usr/local/lib/python3.9/dist-packages/pdfminer/psparser.py", line 239, in fillbuf
    raise PSEOF("Unexpected EOF")
pdfminer.psparser.PSEOF: Unexpected EOF