openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
288 stars 73 forks source link

Use libzim `IndexData::getContent` to provide currated content to index. #1810

Closed mgautierfr closed 1 year ago

mgautierfr commented 1 year ago

libzim provides a way for scrappers to provide a different content than the one stored for the indexation.

It allow a better indexation when a lot of content is not relevant about the subject of the content itself.

mwoffliner should parse the html content and extract only the relevant information (so remove thing such has menu, footer, examples, ...)

See comments in https://github.com/openzim/libzim/issues/653

kelson42 commented 1 year ago

@mgautierfr Can you please expkain:

Jaifroid commented 1 year ago

See also #1725, which looks similar. Note comments there.

mgautierfr commented 1 year ago

The idea is that we want to index a content different that what we are storing. Some content don't have to be indexed. Some other content cannot be indexed (a video) and we want to provide a textual description (from subtitle ?) to index it anyway.

The problem is less visible than I expected on mwoffliner as we use the mobile version and it doesn't include all menus and side bars.

But I have found this one : https://library.kiwix.org/viewer#search?books.name=wikipedia_en_physics_maxi_2023-02&pattern=gazette

The results are not related to gazette. But as the references are coming from gazette, the articles seems relevant to xapian.

kelson42 commented 1 year ago

@mgautierfr perfectly agree, just that i see no straight relation to https://github.com/openzim/libzim/issues/653. Depends on #1576

kelson42 commented 1 year ago

@Jaifroid Thank you for remembering #1725, went actually out of my radar. This is indeed a duplicate of this one. We agree on the improvement potential and on rhe approach.