openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
20 stars 18 forks source link

Add indexdata + automatic indexing of PDF items #182

Closed benoit74 closed 3 months ago

benoit74 commented 4 months ago

Fix #167 Fix #168

Edited description

Changes:

Former description and points to discuss

Changes:

Open points to discuss:

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.00%. Comparing base (1eddabc) to head (c91646f).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #182 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 32 33 +1 Lines 1452 1531 +79 Branches 251 273 +22 ========================================= + Hits 1452 1531 +79 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

benoit74 commented 4 months ago

Some files like https://irp.fas.org/doddir/milmed/milderm.pdf are raising "MuPDF error: format error: cmsOpenProfileFromMem failed" error. Looks like it could be fixed since it is an ICC profile issue (for which we do not care): https://github.com/pymupdf/PyMuPDF/discussions/3572. I will fix this.

benoit74 commented 4 months ago

Fix is different than expected, but at least it is working, PR is again ready for review

benoit74 commented 3 months ago

I did not passed index_content: str | None = None but index_data: IndexData | None since it also allows to set the title which is used for suggestions, which is quite important (item title is not used for suggestions when index data is passed)

And I also modified add_item_for since this is quite heavily used in scrapers.

Other than that, I think the change will please you.

rgaudin commented 3 months ago

I did not passed index_content: str | None = None but index_data: IndexData | None since it also allows to set the title which is used for suggestions

I see it's missing from my comment but I meant index_content and index_title. I think requiring this extra import is in opposition with what add_item_for tries to achieve but you're the judge of that.

There are a couple of unresolved discussions…

benoit74 commented 3 months ago

I see it's missing from my comment but I meant index_content and index_title. I think requiring this extra import is in opposition with what add_item_for tries to achieve but you're the judge of that.

Then I get what you meant, and I agree the extra import is not very lean

benoit74 commented 3 months ago

I finally decided to keep using index_data in add_item_for and StaticItem because it is a convenient way to force user to pass both title and content should he decide to customize index_data and to detect when this is not done with pyright. Otherwise one might be tempted to pass only an index_title or only an index_content and this is not what we want.