pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.79k stars 918 forks source link

Some image colors is changed after extraction #888

Open omvishwas opened 1 year ago

omvishwas commented 1 year ago

I am using this code for the extraction of the images from PDF, It's working fine on some images but for some images it's changing the colors of the image. Like for example I have a images which have Orange color in PDF ,But after extraction color is changed into pink.

So can you please suggest me how I can overcome this?

This is the code I am using for image extraction.

`import pdfminer from pdfminer.image import ImageWriter from pdfminer.high_level import extract_pages

pages = list(extract_pages('Socument.pdf')) page = pages[0]

def get_image(layout_object): if isinstance(layout_object, pdfminer.layout.LTImage): return layout_object if isinstance(layout_object, pdfminer.layout.LTContainer): for child in layout_object: return get_image(child) else: return None

def save_images_from_page(page: pdfminer.layout.LTPage): images = list(filter(bool, map(get_image, page))) iw = ImageWriter('image') for image in images: iw.export_image(image)

save_images_from_page(page) `

omvishwas commented 1 year ago

Im0 1 Capture

khuongdvan commented 8 months ago

I also had the same problem. Some images after extraction have bmp extensions that change color or become unreadable This is my code


import pdfminer
from pdfminer.image import ImageWriter
from pdfminer.high_level import extract_pages

def get_image(layout_object):
    # recursively locate Image objects in page_layout
    if isinstance(layout_object, pdfminer.layout.LTImage):
        return [layout_object]
    if isinstance(layout_object, pdfminer.layout.LTContainer):
        img_list = []
        for child in layout_object:
            img_list = img_list + get_image(child)
        return img_list
    else:
        return []

def extract_pdf_img(pdf_filepath):
    iw = ImageWriter('output_dir')
    for page_layout in extract_pages(pdf_filepath):
        image_list = get_image(page_layout)
        if len(image_list):
            for image in image_list:
                iw.export_image(image)

if __name__ == "__main__":
    pdf_filepath = "sample.pdf"
    extract_pdf_img(pdf_filepath)
pietermarsman commented 8 months ago

@omvishwas @khuongdvan Can either of you share the pdf? Otherwise we cannot debug.

omvishwas commented 4 months ago

Hi @pietermarsman,

Please find the PDF and let me know if you need any information.

cirse_PIB_2021_arterial_angioplasty_and_stenting_IT.pdf