jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
5.99k stars 618 forks source link

page.to_image() PDFium: Data format error #1140

Closed Hucley closed 1 month ago

Hucley commented 1 month ago

Describe the bug

    try:
        page_image = page.to_image(resolution=self.resolution)
    except:
        logger.error("---------extract_pdf_img----------")
        logger.error(f"resolution:{self.resolution},type:{type(self.resolution)}")
        logger.error(f"page:{page},page_number:{page.page_number}")
        error_info = traceback.format_exc()
        logger.error(error_info)
        page_image = page.to_image()
        self.res_scale_rate = 1.0

File "/data2/doc2json/parse_utils/pdfplumber_postprocessing.py", line 115, in extract_pdf_img page_image = page.to_image(resolution=self.resolution) File "/usr/local/lib/python3.10/site-packages/pdfplumber/page.py", line 535, in to_image return PageImage( File "/usr/local/lib/python3.10/site-packages/pdfplumber/display.py", line 84, in init self.original = get_page_image( File "/usr/local/lib/python3.10/site-packages/pdfplumber/display.py", line 56, in get_page_image pdfium_page = pypdfium2.PdfDocument( File "/usr/local/lib/python3.10/site-packages/pypdfium2/_helpers/document.py", line 78, in init self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) File "/usr/local/lib/python3.10/site-packages/pypdfium2/_helpers/document.py", line 679, in _open_pdf pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error).

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/data2/doc2json/parse_tools/pdfParse.py", line 772, in pdf2txt json_dict = p2j.PDF2json(pdf_path,file_buffer,begin_page_index,end_page_index,extract_page_index) File "/data2/doc2json/parse_utils/pdf2json.py", line 372, in PDF2json page_img_dict = self.extpdf.extract_pdf_img(page) File "/data2/doc2json/parse_utils/pdfplumber_postprocessing.py", line 120, in extract_pdf_img page_image = page.to_image() File "/usr/local/lib/python3.10/site-packages/pdfplumber/page.py", line 535, in to_image return PageImage( File "/usr/local/lib/python3.10/site-packages/pdfplumber/display.py", line 84, in init self.original = get_page_image( File "/usr/local/lib/python3.10/site-packages/pdfplumber/display.py", line 56, in get_page_image pdfium_page = pypdfium2.PdfDocument( File "/usr/local/lib/python3.10/site-packages/pypdfium2/_helpers/document.py", line 78, in init self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) File "/usr/local/lib/python3.10/site-packages/pypdfium2/_helpers/document.py", line 679, in _open_pdf pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error).

Environment

jsvine commented 1 month ago

Interesting, thanks. Can you provide the relevant PDF? It will be difficult to assess this issue without it.

Hucley commented 1 month ago

This problem occurs in any pdf file that contains illustrations, but currently it only occurs in docker. The same code is not present on the local machine or running directly on the server.There is too little information to determine whether it is related to the system environment

jsvine commented 1 month ago

Noted, thank you. Closing this issue given the lack of reproducibility and Docker-specificity. Feel free to continue the conversation here, though, and particularly if you can provide more details or a script to fully reproduce the results.