openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
288 stars 73 forks source link

Improve the quality of article content indexed by xapian #1725

Open kelson42 opened 7 years ago

kelson42 commented 7 years ago

From @automactic on June 14, 2016 21:36

Problem:

In current xapian indexing process, the content of of article extracted by omega contains a lot of useless info, such as reference section, the legal footnote and the inline references.

Desired Output:

A clean string of article content, without

The "apple juice" article in wikipedia_en_simple_all_2016-05.zim Here is the info extracted by omega html parser and passed to xapian for indexing:

Title:Apple juice
Keywords:
Snippet:Apple juice Apple juice Not to be confused with cider. Apple juice is the juice from apples. It does not have alcohol, and it tastes sweet from the natural fruit sugars. Many companies making apple juice like to say that they do not add more sugar into the drink, and there is only natural sugar
Content:apple juice apple juice not to be confused with cider. apple juice is the juice from apples. it does not have alcohol, and it tastes sweet from the natural fruit sugars. many companies making apple juice like to say that they do not add more sugar into the drink, and there is only natural sugar. origin the apple tree came from the same era as elizabethan in the late 1500's and early 1600's (pyrus malus), and is native to britain. even in the old saxon papers, apples and cider are mentioned a lot.[1] the fruit is thought to have come in the caucasus, a place with many mountains between the black and caspian seas.[1] the lady apple, a kind of apple still grown today, is believed to be one of the oldest apple trees on record. healthiness it is remarkable how closely the history of the apple tree is connected with that of man. —henry david thoreau in both facts and stories, the apple appears to be very healthy. there are two types of apple juice. one is the clear apple juice, and the other is the cloudy apple juice. pectin and starch are taken out during the production process to produce clear apple juice. cloudy apple juice is cloudy because of evenly-distributed small pulp suspensions in the juice concentrate.[1] also, in apple juice, the vitamin c, and other vitamins are contained inside, as well as mineral nutrients such as boron which helps build strong bones. research from the university of massachusetts lowell shows that apple juice also increases acetylcholine in the brain, which gets you increased memory. apples can also be a main source of fiber, and is a powerful cleanser and an important necessity for the health of your body.[2] the compounds in apple juice called phytonutrients delay the break down of ldl or cholesterol. in history, the phrase from benjamin franklin "an apple a day keeps the doctor away" is very famous. new research is proving this phrase to be a fact. researchers at uc davis school of medicine have recently found out that drinking apple juice seems to slow down the process that may lead to heart disease. researchers at the university of groningen in the netherlands had studied and found that smokers who ate many fruits and vegetables, especially apples, had reduced their risk of getting the common diseases smokers would get. the risk was reduced by 50%.[2] for older people, drinking fruit juices should begin with apples, especially if they are suffering from arthritis and rheumatism. this is because apples carry a substantial amount of potassium. because of this, eating apples or apple juice has been known to help. drinking apple juice also removes some toxins from the liver and kidneys and is low in calories. over time, this can reduce the chances of having liver or kidney disease.[2] use apple juice can be used to make cider and calvados. some types of cider and all types of calvados contain alcohol. production addressed as one of the most popular fruits in the world, the apple is cultivated in around 7,500 different kinds in shape, color, texture, firmness, crispness, acidity, juiciness, sweetness, nutrition, and harvesting time.[1] references 1 2 3 4 "apple juice". agriculturalproductsindia.com. http://www.agriculturalproductsindia.com/beverages-juices/beverages-juices-apple-juice.html. retrieved 28 april 2010. 1 2 3 "apple juice". soymilkquick.com. http://www.soymilkquick.com/applejuice.php. retrieved 28 april 2010. this article is issued from wikipedia - version of the tuesday, april 26, 2016. the text is available under the creative commons attribution/share alike but additional terms may apply for the media files.

Possible Solution:

Add UdmCommentmmarkup to comment out parts of the html, so omega html parser can ignore them. (source)

Copied from original issue: kiwix/kiwix#244

kelson42 commented 7 years ago

I agree with the principle of adding metatag information to opt-out part of the HTML text. Not sure this is the thing to do for all the examples you have given, but this sounds definitely a good approach. Give it a try!

kelson42 commented 7 years ago

From @automactic on June 19, 2016 20:48

What code should I modify to add comments to html strings? Also, do you think adding comments to html string will increase the size of zim files?

mgautierfr commented 7 years ago

I've implemented few things in commits https://github.com/openzim/zimwriterfs/commit/79921c8efa09eae9bb7f6a69edfcfb588ad63589 and https://github.com/openzim/zimwriterfs/commit/7302f4730a01af298f6cc9eba97b7a2df9d93895 (not merged) At indexing time, it try to remove span with reference or backlink.

However, it seems that html span class names change depending of the lang of the article. For example, in English it is mw-cite-backlink and in french it is reference-text. It seems pretty complicated to implement this correctly in libzim or zimwriterfs. Maybe it should be mwoffliner to parse the html and tag the content correctly to index or not.

Jaifroid commented 1 year ago

Not sure if this is still relevant, but having stumbled across this issue while looking for something else, I'd just like to point out the value of indexing references (footnotes/endnotes) from an academic perspective. If I am searching for information about some obscure historical figure, it would be very valuable to be able to find quickly a bibliographical resource on that person, say, in an article that might be about some other event I would never have thought to look under.

mgautierfr commented 1 year ago

libzim now provide IndexData interface to allow the scrapper to give the data they want to be indexed. Either we can close this issue or move it on the scrapper side (mwoffliner ?)

kelson42 commented 1 year ago

@mgautierfr Mostly Wikipedia/Mediawiki stuff. Moving to openzim/mwoffliner.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.