py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.36k stars 1.41k forks source link

Provide public interface for skipping inline page images #1987

Closed stefan6419846 closed 1 year ago

stefan6419846 commented 1 year ago

Explanation

I want to extract all images from a page, but omit inline images as they are not really useful in my case and just generate overhead (2 ms without and 29 s with inline images for one page with a dotted table which has 24643 inline images, but no "real" images).

Code Example

For now, I am basically exploiting https://github.com/py-pdf/pypdf/blob/c2a741e968da1b4d7195a7924ba62a9445f825e4/pypdf/_page.py#L478 which does not seem to be a clean solution:

from pypdf import PdfReader

reader = PdfReader(path)
for page in reader.pages:
    page.inline_images = dict()  # Avoid loading inline images.
    for image in page.images:
        print(image)
pubpub-zz commented 1 year ago

@stefan6419846 inline images names (not including extension) are starting and ending with "\~". you should be able to continue your loops when the name starts with "~"

stefan6419846 commented 1 year ago

I am aware of that, but given that this seems to reduce performance quite much just to throw these items away afterwards, a dedicated option might still make sense here.

pubpub-zz commented 1 year ago

for performance, try first to extract the keys with images.keys() and then extract the images.

stefan6419846 commented 1 year ago

Could you please elaborate this? Calling page.images.keys() is the same as calling page._get_ids_image(), which is the expensive method I mentioned above where PyPDF checks if self.inline_images is None.

pubpub-zz commented 1 year ago

This is what I mean

for p in reader.pages:
     lst =[ k for k in p.images.keys() if not k.startswith("~")]
     for n in lst:
         print(p.images[n].name)
MartinThoma commented 1 year ago

You could do this:

for p in reader.pages:
    p.inline_images = {}  # <-- that prevents `_get_inline_images` from being called in `_get_ids_image`
     lst =[ k for k in p.images.keys() if not k.startswith("~")]
     for n in lst:
         print(p.images[n].name)

Does that have a noticable impact for you?

stefan6419846 commented 1 year ago

Regarding the first code snippet:

for p in reader.pages:
     lst =[ k for k in p.images.keys() if not k.startswith("~")]
     for n in lst:
         print(p.images[n].name)

This will still load all images during the list comprehension, as mentioned inside https://github.com/py-pdf/pypdf/issues/1987#issuecomment-1643930635

The second code snippet

for p in reader.pages:
    p.inline_images = {}  # <-- that prevents `_get_inline_images` from being called in `_get_ids_image`
     lst =[ k for k in p.images.keys() if not k.startswith("~")]
     for n in lst:
         print(p.images[n].name)

basically is the same as my initial code example and the workaround I am currently using, which relies on internal implementation details.

Does that have a noticable impact for you?

Without doing any profiling again, the numbers in my initial report indicate that this is more than 10x faster - as I iterate over all images there directly without the filtering as in your list comprehension, there might be some smaller differences as well. The slow part is loading the internal images, which monkey-patching page.inline_images = {} prevents and I could verify to have the most impact on performance in this case.

pubpub-zz commented 1 year ago

@stefan6419846 please provide a PDF we could compare performances

MartinThoma commented 1 year ago

basically is the same as my initial code example

oops, sorry, didn't remember that

I agree with @pubpub-zz that I would like to see how significant the performance difference is. Instead of skipping some files, I'd rather try to improve the iteration code / defer decoding the images to the point where you actually want to read the data.

Can you tell me a bit about your use case? What kind of data do you have / what are the PDFs about and why can you skip inline-images but not other images?

pubpub-zz commented 1 year ago

@stefan6419846 what confuses me is that the images should not be extracted when asking for keys only. Also Martin is right: the inline images should be so small that their extractions should be short.

stefan6419846 commented 1 year ago

I have sent a reproducing document to @MartinThoma using the e-mail associated with his commits, as I could not find a publicly available version of the document.

Can you tell me a bit about your use case? What kind of data do you have / what are the PDFs about and why can you skip inline-images but not other images?

There are all sorts of PDFs generated from every kind of source which run through text extraction code to allow indexing/searching. Analyzing the dataset indicated that very small images tend to not provide any useful information (while just blocking resources for OCR), thus small images are being skipped. With inline images being 4 KB or less (https://www.verypdf.com/document/pdf-format-reference/pg_0352.htm), chances are high that they indeed do not provide any useful information as well and thus can be skipped as well in this case.

stefan6419846 commented 1 year ago

I just managed to generate a version of the offending document/page which omits the personal data, so I can finally upload it here: table_redacted.pdf

pubpub-zz commented 1 year ago

@stefan6419846 I'm currently preparing a PR to accelerate the generation of the key list which should make my code proposal good. the PR can be tested, the only problem is dealing with mypy check

stefan6419846 commented 1 year ago

@pubpub-zz I just tested your solution. While it works on the example file and I can see the desired speed improvements, the filtering fails when the key is a list (which is easy to mitigate on my side, but I just wanted to point this out):

[...]
/Image185
/Image185 ImageFile(name=Image185.png, data: 525 Byte)
['/Meta187', '/Image188']
Traceback (most recent call last):
  File "/home/stefan/temp/test.py", line 8, in <module>
    if key.startswith('~'):
AttributeError: 'list' object has no attribute 'startswith'

This might be similar to what mypy complains about with

pypdf/_page.py:505: error: Unsupported operand types for + ("List[Union[str, List[str]]]" and "List[str]") [operator]

as well.

pubpub-zz commented 1 year ago

@pubpub-zz I just tested your solution. While it works on the example file and I can see the desired speed improvements, the filtering fails when the key is a list (which is easy to mitigate on my side, but I just wanted to point this out):

I forgot the list output (when images are within XImage Can you provide your solution for others ?

This might be similar to what mypy complains about with

pypdf/_page.py:505: error: Unsupported operand types for + ("List[Union[str, List[str]]]" and "List[str]") [operator]

As said in the PR, the mypy issue is under analysis

stefan6419846 commented 1 year ago

This is my current code, given that inline image keys never are lists (source):

from pypdf import PdfReader

reader = PdfReader('file.pdf')
for page in reader.pages:
    for key in page.images.keys():
        print(key)
        if isinstance(key, str) and key.startswith('~'):
            continue
        image = page.images[key]
        print(key, image)
        image.image.save('out/' + image.name)