ipwndev / dswiki

Automatically exported from code.google.com/p/dswiki
0 stars 0 forks source link

Add switch to indexer to filter out low word count articles #9

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
1. There is currently no mechanism to reduce the size of a processed wiki to 
fit onto smaller cards.
2. Wiki dumps that will fit onto smaller cards are out of date.
3. The only documents filtered out are those that are not supported (e.g. 
templates, metadata)

There should be a simple and elegant way to trim the size of large dumps by the 
indexer.  One easy way would be to filter out articles that have a low word 
count, or low character count, as these often are "stubs" and do not contain 
much useful information.  There could be an extra option to include or exclude 
redirects, since those are often low character count.

Original issue reported on code.google.com by charles....@gmail.com on 10 Aug 2010 at 2:21