BUG: ImageExtraction not extracting all the images in pdf

luojunhui1 commented 1 year ago

Describe the bug not extracting all the images in pdf

To Reproduce

For a pdf file with 9 pages, there is one image in page 6, page 7, page 8 (page num start with 0), respectively
the ImageExtraction only detected the image in page 7 but ignored the images in page 6 and page 8

# read the Document
doc: typing.Optional[Document] = None
text_l: SimpleTextExtraction = SimpleTextExtraction()
image_l: ImageExtraction = ImageExtraction()

with open(file_path, "rb") as in_file_handle:
    doc = PDF.loads(in_file_handle, [text_l, image_l])

# check whether we have read a Document
assert doc is not None

images = []

for page in range(0, 9):
    if "XObject" in doc.get_page(page)["Resources"]:
        for k, v in doc.get_page(page)["Resources"]["XObject"].items():
            print("%d\t%s" % (page, k))

for page, content in image_l.get_images().items():
    images += (content)
    print(f"image page: {page}")

Expected behaviour the ImageExtraction listenser should return all the images

Screenshots

Desktop (please complete the following information):

OS: Windows10
borb version 2.1.10

Additional context Add any other context about the problem here.

jorisschellekens commented 1 year ago

Please attach the input PDF

luojunhui1 commented 1 year ago

@jorisschellekens i deleted some sensitive infomation from the original PDF, and the output is still not correct. the complete test code is as below

def test_pdf_with_borb(self):
        doc: typing.Optional[Document] = None
        text_l: SimpleTextExtraction = SimpleTextExtraction()
        image_l: ImageExtraction = ImageExtraction()

        file_path = PROJECT_DIR + "data/test/input_doc2.pdf"
        with open(file_path, "rb") as in_file_handle:
            doc = PDF.loads(in_file_handle, [text_l, image_l])

        # check whether we have read a Document
        assert doc is not None

        images = []
        page_num = int(doc.get_document_info().get_number_of_pages())
        print(f"page num: {page_num}")

        for page in range(0, page_num):
            if "XObject" in doc.get_page(page)["Resources"]:
                for k, v in doc.get_page(page)["Resources"]["XObject"].items():
                    print("%d\t%s" % (page, k))

        for page, content in image_l.get_images().items():
            images += (content)
            print(f"image page: {page}")

the test output screenshot is

input_doc2.pdf

jorisschellekens commented 1 year ago

I checked the images in your PDF. It turns out borb does not currently support them (yet). That's why they are not extracted.

luojunhui1 commented 1 year ago

what can i do to extract these images correctly? could you give me any advice, thanks a lot

jorisschellekens commented 1 year ago

You would have to implement your own version of an ImageTransformer (package io and read).

Essentially you need to:

identify when this transformer needs to be triggered
what this transformer needs to do to convert the raw bytes to a PIL Image

hdoer commented 1 year ago

I also encountered this problem. There are some pictures in png format in my pdf. I found it can not extract. There are following steps：

write a PngImageTransformer
write a new loads function like PDF.loads()
add some code to insert PngImageTransformer instance to ReadAnyObjectTransformer: readAnyObjectTransformer.get_children().insert(0, PngImageTransformer())
got the image use get_images function.

Have to say, I am learning the code. Maybe it's not the best solution.

jorisschellekens / borb

BUG: ImageExtraction not extracting all the images in pdf #162