useblocks / libpdf

Extract structured data from PDFs
MIT License
8 stars 2 forks source link

fix dependencies from pillow branch and fix processing of pdf files #24

Closed kreuzberger closed 8 months ago

kreuzberger commented 9 months ago

Tried to take over fixes from mh-update-pillow and got issues with using in simplepdf/pillow/poetry.

Tried to fix the issue and integrated libpdf in a test framework to check sphinx-simplepdf file.

REVIEW required for the bugfix in catalog.py (replaced failing resolve() call on dict with resolve1 call). Maybe this breaks thinks i currently do not know about.

juiwenchen commented 8 months ago

Hi @kreuzberger, I guess you only want to introduce rect model into libpdf. We would actually like to remove the fix on mh-update-pillow from this PR in order to keep the change as atomic as possible. Without this fix, rects extraction should still work fine.

Meanwhile, we are working on our CI, and please try not to commit those changes which only reformats the code. Thanks for your contribution

kreuzberger commented 8 months ago

Hi @kreuzberger, I guess you only want to introduce rect model into libpdf. We would actually like to remove the fix on mh-update-pillow from this PR in order to keep the change as atomic as possible. Without this fix, rects extraction should still work fine.

Meanwhile, we are working on our CI, and please try not to commit those changes which only reformats the code. Thanks for your contribution

@juiwenchen not sure if i understand you right. I leave this PR now without further commits. I promise :smile: Its also ok if you NOT merge / integrate this PR cause things get mixed and you provide a solution by adding several other PR. Main intention of my work is:

Please be aware: Like mentioned in the issue the tests for figures dont work (all of them). This has to be investigated (i do not know much about the pdfs and the tests) and maybe adapted to the above mentionend strategy.

See also https://github.com/useblocks/sphinx-simplepdf/issues/83