Open rerik opened 3 weeks ago
Hi @rerik, and thanks for your interest in pdfplumber
. Can you share the PDF and a minimal Python script that reproduces the problem?
Hi @rerik, and thanks for your interest in
pdfplumber
. Can you share the PDF and a minimal Python script that reproduces the problem?
Oh, I'm sorry, it's my bad. I was absolutely sure I gave the link to the target file: https://storage.googleapis.com/alan-ai-knowledge-base/isu-knowledge-base/Book/fundamentals-book-1-6.pdf
Minimal Python script to reproduce:
import io
import requests
import pdfplumber as pp
SOURCE = 'https://storage.googleapis.com/alan-ai-knowledge-base/isu-knowledge-base/Book/fundamentals-book-1-6.pdf'
response = requests.get(SOURCE)
doc = pp.open(io.BytesIO(response.content))
page = doc.pages[1]
image = page.images[0]
page.crop((
image['x0'],
image['top'],
image['x1'],
image['bottom']
)).to_image(resolution=300).save('img.jpg')
Thank you, this is very helpful. I can reproduce the issue, and will see if I can find a solution.
Describe the bug
It's 2-in-1 problem.
At first, image raw data (for example,
doc.pages[1].images[0]['stream'].rawdata
) is broken. PIL ImagePIL.Image.open(io.BytesIO(doc.pages[1].images[0]['stream'].rawdata))
except an error{UnidentifiedImageError}UnidentifiedImageError('cannot identify image file <_io.BytesIO object at 0x7f62058f5df0>')
. If to save image bytes directly, it's just broken and cannot be opened.I've tried get raw bytes of this image with pypdf lib. It contains ~2 times more bytes and can be eazely saved, so it's not a principial problem of image itself.
At second, if I try to save crop by this image bbox, it miss.
This code saves Instead of
Have you tried repairing the PDF?
Yes, I've tryied. In this case it's just crush with opening:
Environment