Open luojunhui1 opened 1 year ago
Please attach the input PDF
@jorisschellekens i deleted some sensitive infomation from the original PDF, and the output is still not correct. the complete test code is as below
def test_pdf_with_borb(self):
doc: typing.Optional[Document] = None
text_l: SimpleTextExtraction = SimpleTextExtraction()
image_l: ImageExtraction = ImageExtraction()
file_path = PROJECT_DIR + "data/test/input_doc2.pdf"
with open(file_path, "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [text_l, image_l])
# check whether we have read a Document
assert doc is not None
images = []
page_num = int(doc.get_document_info().get_number_of_pages())
print(f"page num: {page_num}")
for page in range(0, page_num):
if "XObject" in doc.get_page(page)["Resources"]:
for k, v in doc.get_page(page)["Resources"]["XObject"].items():
print("%d\t%s" % (page, k))
for page, content in image_l.get_images().items():
images += (content)
print(f"image page: {page}")
the test output screenshot is
I checked the images in your PDF.
It turns out borb
does not currently support them (yet).
That's why they are not extracted.
what can i do to extract these images correctly? could you give me any advice, thanks a lot
You would have to implement your own version of an ImageTransformer
(package io
and read
).
Essentially you need to:
I also encountered this problem. There are some pictures in png format in my pdf. I found it can not extract. There are following steps:
Have to say, I am learning the code. Maybe it's not the best solution.
Describe the bug not extracting all the images in pdf
To Reproduce
Expected behaviour the ImageExtraction listenser should return all the images
Screenshots
Desktop (please complete the following information):
Additional context Add any other context about the problem here.