Add indexdata + automatic indexing of PDF items

benoit74 commented 4 months ago

Fix #167 Fix #168

Edited description

Changes:

create helper class IndexData to hold indexing data (title, content, keywords) before passing it to libzim
add support for automated indexing of PDFs in python-scraperlib, building an IndexData object automatically
- it relies on PyMuPDF for now to extract PDF metadata and content
- item_data title is populated based on title + author + subject, separated by dashes
- item_data content is simply all pages content concatenated with new line to separate them
add support for index_data: IndexData | None and auto_index: bool | None for customizing indexing in StaticItem and add_item_for:
- pass custom index_data from calller for customized indexing
- set auto_index to False to disable indexing (both in python-scraperlib and libzim)
- otherwise, indexing is automated in python-scraperlib for PDF documents (for now) or in libzim (for others, text or html for now)

Former description and points to discuss

Changes:

add new IndexingItem class capable to customize index data from data passed from the scraper or automatically from PDF content
- this uses a new IndexData class holding the index data
- for PDF, it relies on PyMuPDF for now to extract PDF metadata and content
- item title is populated based on title + author + subject, separated by dashes
- item content is simply all pages content concatenated with new line to separate them

Open points to discuss:

do we really need this new IndexingItem class or should we simply embed all this logic in StaticItem?
if we keep the separate class, do we need a new add_indexing_item_for, similar to add_item_for? Or just enrich the add_item_for with new arguments?

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.00%. Comparing base (1eddabc) to head (c91646f).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #182 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 32 33 +1 Lines 1452 1531 +79 Branches 251 273 +22 ========================================= + Hits 1452 1531 +79 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

benoit74 commented 4 months ago

Some files like https://irp.fas.org/doddir/milmed/milderm.pdf are raising "MuPDF error: format error: cmsOpenProfileFromMem failed" error. Looks like it could be fixed since it is an ICC profile issue (for which we do not care): https://github.com/pymupdf/PyMuPDF/discussions/3572. I will fix this.

benoit74 commented 4 months ago

Fix is different than expected, but at least it is working, PR is again ready for review

benoit74 commented 3 months ago

I did not passed index_content: str | None = None but index_data: IndexData | None since it also allows to set the title which is used for suggestions, which is quite important (item title is not used for suggestions when index data is passed)

And I also modified add_item_for since this is quite heavily used in scrapers.

Other than that, I think the change will please you.

rgaudin commented 3 months ago

I did not passed index_content: str | None = None but index_data: IndexData | None since it also allows to set the title which is used for suggestions

I see it's missing from my comment but I meant index_content and index_title. I think requiring this extra import is in opposition with what add_item_for tries to achieve but you're the judge of that.

There are a couple of unresolved discussions…

benoit74 commented 3 months ago

I see it's missing from my comment but I meant index_content and index_title. I think requiring this extra import is in opposition with what add_item_for tries to achieve but you're the judge of that.

Then I get what you meant, and I agree the extra import is not very lean

benoit74 commented 3 months ago

I finally decided to keep using index_data in add_item_for and StaticItem because it is a convenient way to force user to pass both title and content should he decide to customize index_data and to detect when this is not done with pyright. Otherwise one might be tempted to pass only an index_title or only an index_content and this is not what we want.

openzim / python-scraperlib