Open mgautierfr opened 1 year ago
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
@rgaudin @benoit74 Would this approach, here, brings a real improvement? If "yes', which one?
Improvement would be marginal I think because we don't include much non-content text in the HTML. A side effect would be parsing all our output using an in-scraper HTML parser versus letting libzim do it.
libzim provides a way for scrappers to provide a different content than the one stored for the indexation.
It allow a better indexation when a lot of content is not relevant about the subject of the content itself.
mwoffliner should parse the html content and extract only the relevant information (so remove thing such has menu, footer, user information, links to other questions...)
See comments in https://github.com/openzim/libzim/issues/653