Open Galdanwing opened 3 years ago
I'm not sure if the issue is caused by the svg images.
Can you share the PDF such that we can investigate?
Yup, sure, I just retried it with pdfminer.six-20220319 and the amount of objects is the same. pdf_with_svg_image.pdf
Great example!
I used python tools/pdf2txt.py pdf_with_svg_image.pdf
and it runs for a long time, although there is no text.
I think the preferred solution here is similar to #455, being able to ignore the images altogether.
Hmm yes, so this looks like a duplicate then if I understand that other ticket correctly? In which case this can be closed? Or do you envision something else being required here
I think these are two distinct issues which coincidentally have the same solution. So I prefer to keep both issues open.
Hi, I've been working with some pdf's in PDFminer that when processed cause very high memory usage, this seems to be due to the amount of objects created by analyzing pages with SVG images.
Steps to reproduce:
To me, 100k objects from a single figure seems obscene, and this runs into memory issues on my machine, especially when I try to scan multiple pages with SVG's.
My ideal solution would include limiting the amount of objects that can be gathered from a single figure, but I'm not sure how feasible this is.:
Any other options? Or maybe a pointer to what code causes this? I could try and see if I could add support myself and create a PR or fork if necessary.