Closed stefan6419846 closed 1 year ago
@stefan6419846 inline images names (not including extension) are starting and ending with "\~". you should be able to continue your loops when the name starts with "~"
I am aware of that, but given that this seems to reduce performance quite much just to throw these items away afterwards, a dedicated option might still make sense here.
for performance, try first to extract the keys with images.keys()
and then extract the images.
Could you please elaborate this? Calling page.images.keys()
is the same as calling page._get_ids_image()
, which is the expensive method I mentioned above where PyPDF checks if self.inline_images is None
.
This is what I mean
for p in reader.pages:
lst =[ k for k in p.images.keys() if not k.startswith("~")]
for n in lst:
print(p.images[n].name)
You could do this:
for p in reader.pages:
p.inline_images = {} # <-- that prevents `_get_inline_images` from being called in `_get_ids_image`
lst =[ k for k in p.images.keys() if not k.startswith("~")]
for n in lst:
print(p.images[n].name)
Does that have a noticable impact for you?
Regarding the first code snippet:
for p in reader.pages:
lst =[ k for k in p.images.keys() if not k.startswith("~")]
for n in lst:
print(p.images[n].name)
This will still load all images during the list comprehension, as mentioned inside https://github.com/py-pdf/pypdf/issues/1987#issuecomment-1643930635
The second code snippet
for p in reader.pages:
p.inline_images = {} # <-- that prevents `_get_inline_images` from being called in `_get_ids_image`
lst =[ k for k in p.images.keys() if not k.startswith("~")]
for n in lst:
print(p.images[n].name)
basically is the same as my initial code example and the workaround I am currently using, which relies on internal implementation details.
Does that have a noticable impact for you?
Without doing any profiling again, the numbers in my initial report indicate that this is more than 10x faster - as I iterate over all images there directly without the filtering as in your list comprehension, there might be some smaller differences as well. The slow part is loading the internal images, which monkey-patching page.inline_images = {}
prevents and I could verify to have the most impact on performance in this case.
@stefan6419846 please provide a PDF we could compare performances
basically is the same as my initial code example
oops, sorry, didn't remember that
I agree with @pubpub-zz that I would like to see how significant the performance difference is. Instead of skipping some files, I'd rather try to improve the iteration code / defer decoding the images to the point where you actually want to read the data.
Can you tell me a bit about your use case? What kind of data do you have / what are the PDFs about and why can you skip inline-images but not other images?
@stefan6419846 what confuses me is that the images should not be extracted when asking for keys only. Also Martin is right: the inline images should be so small that their extractions should be short.
I have sent a reproducing document to @MartinThoma using the e-mail associated with his commits, as I could not find a publicly available version of the document.
Can you tell me a bit about your use case? What kind of data do you have / what are the PDFs about and why can you skip inline-images but not other images?
There are all sorts of PDFs generated from every kind of source which run through text extraction code to allow indexing/searching. Analyzing the dataset indicated that very small images tend to not provide any useful information (while just blocking resources for OCR), thus small images are being skipped. With inline images being 4 KB or less (https://www.verypdf.com/document/pdf-format-reference/pg_0352.htm), chances are high that they indeed do not provide any useful information as well and thus can be skipped as well in this case.
I just managed to generate a version of the offending document/page which omits the personal data, so I can finally upload it here: table_redacted.pdf
@stefan6419846 I'm currently preparing a PR to accelerate the generation of the key list which should make my code proposal good. the PR can be tested, the only problem is dealing with mypy check
@pubpub-zz I just tested your solution. While it works on the example file and I can see the desired speed improvements, the filtering fails when the key is a list (which is easy to mitigate on my side, but I just wanted to point this out):
[...]
/Image185
/Image185 ImageFile(name=Image185.png, data: 525 Byte)
['/Meta187', '/Image188']
Traceback (most recent call last):
File "/home/stefan/temp/test.py", line 8, in <module>
if key.startswith('~'):
AttributeError: 'list' object has no attribute 'startswith'
This might be similar to what mypy complains about with
pypdf/_page.py:505: error: Unsupported operand types for + ("List[Union[str, List[str]]]" and "List[str]") [operator]
as well.
@pubpub-zz I just tested your solution. While it works on the example file and I can see the desired speed improvements, the filtering fails when the key is a list (which is easy to mitigate on my side, but I just wanted to point this out):
I forgot the list output (when images are within XImage Can you provide your solution for others ?
This might be similar to what mypy complains about with
pypdf/_page.py:505: error: Unsupported operand types for + ("List[Union[str, List[str]]]" and "List[str]") [operator]
As said in the PR, the mypy issue is under analysis
This is my current code, given that inline image keys never are lists (source):
from pypdf import PdfReader
reader = PdfReader('file.pdf')
for page in reader.pages:
for key in page.images.keys():
print(key)
if isinstance(key, str) and key.startswith('~'):
continue
image = page.images[key]
print(key, image)
image.image.save('out/' + image.name)
Explanation
I want to extract all images from a page, but omit inline images as they are not really useful in my case and just generate overhead (2 ms without and 29 s with inline images for one page with a dotted table which has 24643 inline images, but no "real" images).
Code Example
For now, I am basically exploiting https://github.com/py-pdf/pypdf/blob/c2a741e968da1b4d7195a7924ba62a9445f825e4/pypdf/_page.py#L478 which does not seem to be a clean solution: