kiwix / kiwix-xulrunner

[ARCHIVED] Legacy Kiwix desktop solution for Windows/macOS/Linux
https://download.kiwix.org/release/kiwix-xulrunner/
GNU General Public License v3.0
112 stars 28 forks source link

Improve the quality of article content indexed by xapian #244

Closed automactic closed 7 years ago

automactic commented 8 years ago

Problem:

In current xapian indexing process, the content of of article extracted by omega contains a lot of useless info, such as reference section, the legal footnote and the inline references.

Desired Output:

A clean string of article content, without

The "apple juice" article in wikipedia_en_simple_all_2016-05.zim Here is the info extracted by omega html parser and passed to xapian for indexing:

Title:Apple juice
Keywords:
Snippet:Apple juice Apple juice Not to be confused with cider. Apple juice is the juice from apples. It does not have alcohol, and it tastes sweet from the natural fruit sugars. Many companies making apple juice like to say that they do not add more sugar into the drink, and there is only natural sugar
Content:apple juice apple juice not to be confused with cider. apple juice is the juice from apples. it does not have alcohol, and it tastes sweet from the natural fruit sugars. many companies making apple juice like to say that they do not add more sugar into the drink, and there is only natural sugar. origin the apple tree came from the same era as elizabethan in the late 1500's and early 1600's (pyrus malus), and is native to britain. even in the old saxon papers, apples and cider are mentioned a lot.[1] the fruit is thought to have come in the caucasus, a place with many mountains between the black and caspian seas.[1] the lady apple, a kind of apple still grown today, is believed to be one of the oldest apple trees on record. healthiness it is remarkable how closely the history of the apple tree is connected with that of man. —henry david thoreau in both facts and stories, the apple appears to be very healthy. there are two types of apple juice. one is the clear apple juice, and the other is the cloudy apple juice. pectin and starch are taken out during the production process to produce clear apple juice. cloudy apple juice is cloudy because of evenly-distributed small pulp suspensions in the juice concentrate.[1] also, in apple juice, the vitamin c, and other vitamins are contained inside, as well as mineral nutrients such as boron which helps build strong bones. research from the university of massachusetts lowell shows that apple juice also increases acetylcholine in the brain, which gets you increased memory. apples can also be a main source of fiber, and is a powerful cleanser and an important necessity for the health of your body.[2] the compounds in apple juice called phytonutrients delay the break down of ldl or cholesterol. in history, the phrase from benjamin franklin "an apple a day keeps the doctor away" is very famous. new research is proving this phrase to be a fact. researchers at uc davis school of medicine have recently found out that drinking apple juice seems to slow down the process that may lead to heart disease. researchers at the university of groningen in the netherlands had studied and found that smokers who ate many fruits and vegetables, especially apples, had reduced their risk of getting the common diseases smokers would get. the risk was reduced by 50%.[2] for older people, drinking fruit juices should begin with apples, especially if they are suffering from arthritis and rheumatism. this is because apples carry a substantial amount of potassium. because of this, eating apples or apple juice has been known to help. drinking apple juice also removes some toxins from the liver and kidneys and is low in calories. over time, this can reduce the chances of having liver or kidney disease.[2] use apple juice can be used to make cider and calvados. some types of cider and all types of calvados contain alcohol. production addressed as one of the most popular fruits in the world, the apple is cultivated in around 7,500 different kinds in shape, color, texture, firmness, crispness, acidity, juiciness, sweetness, nutrition, and harvesting time.[1] references 1 2 3 4 "apple juice". agriculturalproductsindia.com. http://www.agriculturalproductsindia.com/beverages-juices/beverages-juices-apple-juice.html. retrieved 28 april 2010. 1 2 3 "apple juice". soymilkquick.com. http://www.soymilkquick.com/applejuice.php. retrieved 28 april 2010. this article is issued from wikipedia - version of the tuesday, april 26, 2016. the text is available under the creative commons attribution/share alike but additional terms may apply for the media files.

Possible Solution:

Add UdmCommentmmarkup to comment out parts of the html, so omega html parser can ignore them. (source)

kelson42 commented 8 years ago

I agree with the principle of adding metatag information to opt-out part of the HTML text. Not sure this is the thing to do for all the examples you have given, but this sounds definitely a good approach. Give it a try!

automactic commented 8 years ago

What code should I modify to add comments to html strings? Also, do you think adding comments to html string will increase the size of zim files?

kelson42 commented 7 years ago

This issue was moved to openzim/mwoffliner#1725