pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.95k stars 930 forks source link

Option to filter out SVG images #685

Open Galdanwing opened 3 years ago

Galdanwing commented 3 years ago

Hi, I've been working with some pdf's in PDFminer that when processed cause very high memory usage, this seems to be due to the amount of objects created by analyzing pages with SVG images.

Steps to reproduce:

>>> from pdfminer.high_level import extract_pages
>>> for page in extract_pages("/home/antoine.local/Downloads/pdf_with_svg_image.pdf"):
...   print(len(page._objs))
... 
100029

To me, 100k objects from a single figure seems obscene, and this runs into memory issues on my machine, especially when I try to scan multiple pages with SVG's.

My ideal solution would include limiting the amount of objects that can be gathered from a single figure, but I'm not sure how feasible this is.:

>>> from pdfminer.high_level import extract_pages
>>> for page in extract_pages("/home/antoine.local/Downloads/pdf_with_svg_image.pdf", max_figure_object_amount=1000):
...   print(len(page._objs))
... 
1000

Any other options? Or maybe a pointer to what code causes this? I could try and see if I could add support myself and create a PR or fork if necessary.

pietermarsman commented 2 years ago

I'm not sure if the issue is caused by the svg images.

Can you share the PDF such that we can investigate?

Galdanwing commented 2 years ago

Yup, sure, I just retried it with pdfminer.six-20220319 and the amount of objects is the same. pdf_with_svg_image.pdf

pietermarsman commented 2 years ago

Great example!

I used python tools/pdf2txt.py pdf_with_svg_image.pdf and it runs for a long time, although there is no text.

I think the preferred solution here is similar to #455, being able to ignore the images altogether.

Galdanwing commented 2 years ago

Hmm yes, so this looks like a duplicate then if I understand that other ticket correctly? In which case this can be closed? Or do you envision something else being required here

pietermarsman commented 2 years ago

I think these are two distinct issues which coincidentally have the same solution. So I prefer to keep both issues open.