jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.
https://borbpdf.com/
Other
3.4k stars 147 forks source link

BUG: ImageExtraction not extracting all the images in pdf #162

Open luojunhui1 opened 1 year ago

luojunhui1 commented 1 year ago

Describe the bug not extracting all the images in pdf

To Reproduce

  1. For a pdf file with 9 pages, there is one image in page 6, page 7, page 8 (page num start with 0), respectively
  2. the ImageExtraction only detected the image in page 7 but ignored the images in page 6 and page 8
  3. # read the Document
    doc: typing.Optional[Document] = None
    text_l: SimpleTextExtraction = SimpleTextExtraction()
    image_l: ImageExtraction = ImageExtraction()
    
    with open(file_path, "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [text_l, image_l])
    
    # check whether we have read a Document
    assert doc is not None
    
    images = []
    
    for page in range(0, 9):
        if "XObject" in doc.get_page(page)["Resources"]:
            for k, v in doc.get_page(page)["Resources"]["XObject"].items():
                print("%d\t%s" % (page, k))
    
    for page, content in image_l.get_images().items():
        images += (content)
        print(f"image page: {page}")

Expected behaviour the ImageExtraction listenser should return all the images

Screenshots image

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

jorisschellekens commented 1 year ago

Please attach the input PDF

luojunhui1 commented 1 year ago

@jorisschellekens i deleted some sensitive infomation from the original PDF, and the output is still not correct. the complete test code is as below

def test_pdf_with_borb(self):
        doc: typing.Optional[Document] = None
        text_l: SimpleTextExtraction = SimpleTextExtraction()
        image_l: ImageExtraction = ImageExtraction()

        file_path = PROJECT_DIR + "data/test/input_doc2.pdf"
        with open(file_path, "rb") as in_file_handle:
            doc = PDF.loads(in_file_handle, [text_l, image_l])

        # check whether we have read a Document
        assert doc is not None

        images = []
        page_num = int(doc.get_document_info().get_number_of_pages())
        print(f"page num: {page_num}")

        for page in range(0, page_num):
            if "XObject" in doc.get_page(page)["Resources"]:
                for k, v in doc.get_page(page)["Resources"]["XObject"].items():
                    print("%d\t%s" % (page, k))

        for page, content in image_l.get_images().items():
            images += (content)
            print(f"image page: {page}")

the test output screenshot is image

input_doc2.pdf

jorisschellekens commented 1 year ago

I checked the images in your PDF. It turns out borb does not currently support them (yet). That's why they are not extracted.

luojunhui1 commented 1 year ago

what can i do to extract these images correctly? could you give me any advice, thanks a lot

jorisschellekens commented 1 year ago

You would have to implement your own version of an ImageTransformer (package io and read).

Essentially you need to:

hdoer commented 1 year ago

I also encountered this problem. There are some pictures in png format in my pdf. I found it can not extract. There are following steps:

  1. write a PngImageTransformer
  2. write a new loads function like PDF.loads()
  3. add some code to insert PngImageTransformer instance to ReadAnyObjectTransformer: readAnyObjectTransformer.get_children().insert(0, PngImageTransformer())
  4. got the image use get_images function.

Have to say, I am learning the code. Maybe it's not the best solution.