openzim / sotoki

StackExchange websites to ZIM scraper
https://library.kiwix.org/?category=stack_exchange
GNU General Public License v3.0
216 stars 25 forks source link

Use libzim IndexData::getContent to provide currated content to index. #282

Open mgautierfr opened 1 year ago

mgautierfr commented 1 year ago

libzim provides a way for scrappers to provide a different content than the one stored for the indexation.

It allow a better indexation when a lot of content is not relevant about the subject of the content itself.

mwoffliner should parse the html content and extract only the relevant information (so remove thing such has menu, footer, user information, links to other questions...)

See comments in https://github.com/openzim/libzim/issues/653

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 commented 1 year ago

@rgaudin @benoit74 Would this approach, here, brings a real improvement? If "yes', which one?

rgaudin commented 12 months ago

Improvement would be marginal I think because we don't include much non-content text in the HTML. A side effect would be parsing all our output using an in-scraper HTML parser versus letting libzim do it.