Hi there,
I was having an issue with pymupdf4ll in that some of the images were not extracted from my pdf. At first I thought it was an issue with the margins settings and I tried to fiddle with those to no avail. Then I decided to step into the code with the debugger I finally found out the issue.
The inner function "save_image" in function "to_markdown" at line 332 checks and discards all images whose size is less than 5% of the page size. While normally this might be a desired behavior, in my case this is not the case and I was missing some important info contained in those small images.
If I may I'd like to suggest (yet another) param in to_markdown that allows overriding such setting. Something like:
to_markdown(......., ignore_images_smaller_than=0.05)
so that the caller can pass 0.0 to retain everything
For the time being I've modified the code because I do really need to keep those images.
Thank you for your great package!
Hi there, I was having an issue with pymupdf4ll in that some of the images were not extracted from my pdf. At first I thought it was an issue with the margins settings and I tried to fiddle with those to no avail. Then I decided to step into the code with the debugger I finally found out the issue. The inner function "save_image" in function "to_markdown" at line 332 checks and discards all images whose size is less than 5% of the page size. While normally this might be a desired behavior, in my case this is not the case and I was missing some important info contained in those small images. If I may I'd like to suggest (yet another) param in to_markdown that allows overriding such setting. Something like: to_markdown(......., ignore_images_smaller_than=0.05) so that the caller can pass 0.0 to retain everything
For the time being I've modified the code because I do really need to keep those images. Thank you for your great package!