pypdfium2-team / pypdfium2

Python bindings to PDFium
https://pypdfium2.readthedocs.io/
419 stars 17 forks source link

当,pdf某一页异常的时候,renderer 到那一页后就会一直卡顿 #324

Closed haike-1213 closed 3 months ago

haike-1213 commented 3 months ago

Checklist

Description

def convert_pdf_to_images(file_path, scale=300 / 72):
    pdf_file = pdfium.PdfDocument(file_path)

    page_indices = [i for i in range(len(pdf_file))]

    renderer = pdf_file.render(
        pdfium.PdfBitmap.to_pil,
        page_indices=page_indices,
        scale=scale,
    )

    final_images = []

    for i, image in zip(page_indices, renderer):
        image_byte_array = BytesIO()
        image.save(image_byte_array, format='jpeg', optimize=True)
        image_byte_array = image_byte_array.getvalue()
        final_images.append(dict({i: image_byte_array}))

    return final_images

Install Info

renderer = pdf_file.render(
            pdfium.PdfBitmap.to_pil,
            page_indices=page_indices,
            scale=scale,
        )
        # # 将迭代器转换为列表,这样可以立即获取所有渲染结果
        # rendered_pages = list(renderer)
mara004 commented 3 months ago

Machine-generated translations (Chinese -> English):

Title:

当,pdf某一页异常的时候,renderer 到那一页后就会一直卡顿 When a page in the PDF is abnormal, the renderer will be stuck after reaching that page

Comment in install info:

将迭代器转换为列表,这样可以立即获取所有渲染结果 Convert the iterator to a list so that all rendering results can be obtained immediately

mara004 commented 3 months ago

In general, if you want support, you should provide valid install info (at least the pypdfium2 version used), as well as a PDF that triggers the issue.

Note that the document-level renderer is deprecated, see https://pypdfium2.readthedocs.io/en/stable/changelog.html#rationale-for-pdfdocument-render-deprecation If you are using a version prior to 4.25, it getting stuck is - unfortunately - quite imaginable.

Also note that you should design your code iteratively. Keeping all renderings in memory is problematic, esp. for long pdfs / high scales.