Open kelson42 opened 7 years ago
I agree with the principle of adding metatag information to opt-out part of the HTML text. Not sure this is the thing to do for all the examples you have given, but this sounds definitely a good approach. Give it a try!
From @automactic on June 19, 2016 20:48
What code should I modify to add comments to html strings? Also, do you think adding comments to html string will increase the size of zim files?
I've implemented few things in commits https://github.com/openzim/zimwriterfs/commit/79921c8efa09eae9bb7f6a69edfcfb588ad63589 and https://github.com/openzim/zimwriterfs/commit/7302f4730a01af298f6cc9eba97b7a2df9d93895 (not merged) At indexing time, it try to remove span with reference or backlink.
However, it seems that html span class names change depending of the lang of the article.
For example, in English it is mw-cite-backlink
and in french it is reference-text
.
It seems pretty complicated to implement this correctly in libzim or zimwriterfs.
Maybe it should be mwoffliner to parse the html and tag the content correctly to index or not.
Not sure if this is still relevant, but having stumbled across this issue while looking for something else, I'd just like to point out the value of indexing references (footnotes/endnotes) from an academic perspective. If I am searching for information about some obscure historical figure, it would be very valuable to be able to find quickly a bibliographical resource on that person, say, in an article that might be about some other event I would never have thought to look under.
libzim now provide IndexData interface to allow the scrapper to give the data they want to be indexed. Either we can close this issue or move it on the scrapper side (mwoffliner ?)
@mgautierfr Mostly Wikipedia/Mediawiki stuff. Moving to openzim/mwoffliner
.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
From @automactic on June 14, 2016 21:36
Problem:
In current xapian indexing process, the content of of article extracted by omega contains a lot of useless info, such as reference section, the legal footnote and the inline references.
Desired Output:
A clean string of article content, without
Example:
The "apple juice" article in wikipedia_en_simple_all_2016-05.zim Here is the info extracted by omega html parser and passed to xapian for indexing:
Possible Solution:
Add UdmCommentmmarkup to comment out parts of the html, so omega html parser can ignore them. (source)
Copied from original issue: kiwix/kiwix#244