openzim / libzim

Reference implementation of the ZIM specification
https://download.openzim.org/release/libzim/
GNU General Public License v2.0
165 stars 49 forks source link

Fix stopwords #324

Open ghost opened 4 years ago

ghost commented 4 years ago

@kelson42 commented on Mar 12, 2017, 8:21 PM UTC:

Stopwords are words which should not be indexed (during the FT index process) and also be ignore during the FT search. This stopwords are language specific. Lists are provided by Xapian and are at least used in zimwriterfs. This task is about to check that everything works fine on the indexing part but also on the reader side.

This issue was moved by kelson42 from kiwix/kiwix-lib#26.

ghost commented 4 years ago

@kelson42 commented on Apr 16, 2017, 7:29 PM UTC:

@mgautierfr

I have tested with the following file (create recently with last master version of libzim, libkiwix, kiwix-source and zimwriterfs): http://tmp.kiwix.org/wikipedia_fr_articles_2017-04.zim

If I search (using kiwix-serve) on "il " I get no result (even if the word "il" appears in many articles... so it works.

But if I search "le ", I get only "Londres" as result, which is strange because this words appears in all articles in because this is an obvious stop word it should appears nowhere.

Somethings looks to be wrong here.

ghost commented 4 years ago

@kelson42 commented on Apr 16, 2017, 7:33 PM UTC:

Another question: openzim/libzim#12

ghost commented 4 years ago

@mgautierfr commented on Jun 12, 2017, 10:16 AM UTC:

Seems to be a xapian bug will indexing (https://trac.xapian.org/ticket/750)

The "lea" of "lea valley" in the "Londres" article is stemmed to "le" but not stopped correctly.

ghost commented 4 years ago

stale[bot] commented on Nov 21, 2019, 12:08 AM UTC:

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 commented 3 years ago

@maneeshpm Maybe you would be interested to refresh (and maybe fix) this ticket as it deals with two other aspects of Xapian search: stop words and steeming. IMO we should need first to confirm that the reported problems are still there.

maneeshpm commented 3 years ago

@kelson42 I can confirm that we still have this issue. But just like the Xapian team suggested on the upstream ticket, this is pretty much limited. We have this issue in FT search when a "nonstopword term" is stemmable to a stopword. For example,

The index time upstream bug still exists for the first case. I tried the suggestions mentioned in the Xapian ticket shared by Matthieu, but no good. I have pinged the Xapian team informing them of the same, yet to receive a response.

kelson42 commented 3 years ago

@maneeshpm Thank you for the update. Where have you pinged the Xapian team? The upstream ticket has not been updated?

maneeshpm commented 3 years ago

@kelson42 I pinged them on the Xapian irc. Should I update that ticket as well?

kelson42 commented 3 years ago

@maneeshpm No, thanks.

kelson42 commented 9 months ago

@mgautierfr I reactivate this ticket as it seem upstream Xapian dev is working on the topic. Could you please deliver him a feedback?

kelson42 commented 6 months ago

@mgautierfr Any feedback here?

mgautierfr commented 5 months ago

Nothing to do on our side. At least about this specific issue and update on xapian side.