iiab / internet-in-a-box

Humane Heritage - OLD VERSION
Other
102 stars 17 forks source link

Wikipedia Search #42

Closed braddockcg closed 11 years ago

braddockcg commented 11 years ago

Kiwix indexes wikipedia zim files using the search engine Xapian. These indicies are in knowledge/modules/wikipedia-kiwix.

We are moving away from using Kiwix at run-time, and instead are extracting articles from the zim files more directly in Python.

We have lost full-text wikipedia search by eliminating kiwix at run-time. We need python code to perform searches on the kiwix indices. The kiwix C++ source code can be used as a guide in how to search (although it doesn't produce the greatest results). Xapian bindings are available for python.

braddockcg commented 11 years ago

The wikipedia indices work out of the box with the Xapian example python code at: http://xapian.org/docs/bindings/python/examples/simpleexpand.py

cd /knowledge/modules/wikipeida-kiwix
simpleexpand.py wikipedia_en_all_nopic_01_2012 '*olpc*'
braddockcg commented 11 years ago

James, I've made it easy to iterate through all wikipedia articles, with title, in a zim file. You could use this to build a title search index. I'd prefer that be done in Whoosh. I'd like to only have a dependency on Xapian for the Kiwix indices for the full text search (although if you think it is easier for us to index ourselves with whoosh that is fine by me).

Here is example usage of the iiab/zimpy.py interface to retrieve titles and other information. ZimFile.articles() returns a generator. You probably only care about the 'title' and 'fullUrl' fields.

In [7]: import iiab.zimpy
In [8]: z = iiab.zimpy.ZimFile("/bulk/knowledge2/modules/wikipedia-zim/wikipedia_gn_all_01_2013.zim")
In [10]: list(z.articles())[1000]
Out[10]: 
{'blobNumber': 24,
 'clusterNumber': 5,
 'fullUrl': u'A/Esteio.html\n',
 'mimetype': 0,
 'namespace': 'A',
 'parameter': '',
 'parameterLen': 0,
 'revision': 0,
 'title': u'Esteio',
 'url': u'Esteio.html'}
'}
omwah commented 11 years ago

Thanks. I have been reading up on Woosh. But I think I will build an index file that includes the full text. I believe I could strip out the HTML tags for it to index only the article contents. I think this is probably the best option instead of having two different search mechanisms.

omwah commented 11 years ago

It also seems to me we need to do the full text indexing ourselves anyways because of language specific issues like stop words. In lieu of a list of stop words for every supported language we will need to look at the reverse index and trim out overly occurring outliers.

braddockcg commented 11 years ago

You can download an english language zim from: http://www.kiwix.org/wiki/Wikipedia_in_all_languages

I would suggest the one labeled "6,000 for schools", which is "Wikipedia Selection for Schools" and is a manageable size.

You can also download a torrent with the Xapian index pre-generated from that page.

You can use my zimpy.py to iterate through the articles if you want to build your own full text search indices. It certainly would be a cleaner way to go and removes the Xapian dependency.

My understanding is that Whoosh is about half the speed of Xapian, but I haven't done any performance testing.