Open ghost opened 4 years ago
@kelson42 commented on Apr 16, 2017, 7:29 PM UTC:
@mgautierfr
I have tested with the following file (create recently with last master version of libzim, libkiwix, kiwix-source and zimwriterfs): http://tmp.kiwix.org/wikipedia_fr_articles_2017-04.zim
If I search (using kiwix-serve) on "il " I get no result (even if the word "il" appears in many articles... so it works.
But if I search "le ", I get only "Londres" as result, which is strange because this words appears in all articles in because this is an obvious stop word it should appears nowhere.
Somethings looks to be wrong here.
@kelson42 commented on Apr 16, 2017, 7:33 PM UTC:
Another question: openzim/libzim#12
@mgautierfr commented on Jun 12, 2017, 10:16 AM UTC:
Seems to be a xapian bug will indexing (https://trac.xapian.org/ticket/750)
The "lea" of "lea valley" in the "Londres" article is stemmed to "le" but not stopped correctly.
stale[bot] commented on Nov 21, 2019, 12:08 AM UTC:
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
@maneeshpm Maybe you would be interested to refresh (and maybe fix) this ticket as it deals with two other aspects of Xapian search: stop words and steeming. IMO we should need first to confirm that the reported problems are still there.
@kelson42 I can confirm that we still have this issue. But just like the Xapian team suggested on the upstream ticket, this is pretty much limited. We have this issue in FT search when a "nonstopword term" is stemmable to a stopword. For example,
"Le"
, "Les"
or "Lea"
(Lea stemmed to Le):
We can see that it matches all the stemmed versions in our document, which includes the stopwords "le" and "les"(which it shouldn't). The reason is evident from the parsed query Query(Zle@1)
which is the same for the three terms. We can also see this via the snippets ...<b>le</b> fish ...-frites), <b>le</b> haggis... farcie), <b>les</b> pies ... encore <b>le</b> Sunday...
But other stopwords which do not have this property are stopped properly."Le sunday"
:
Things work as expected. The parsed query don't include the stopword Query(Zsunday@2)
and they are not picked up in the snippets as well ...ou encore le <b>Sunday</b> roast...
"Le la"
:
Again works as expected since La does's have the first property and we get no matches. The index time upstream bug still exists for the first case. I tried the suggestions mentioned in the Xapian ticket shared by Matthieu, but no good. I have pinged the Xapian team informing them of the same, yet to receive a response.
@maneeshpm Thank you for the update. Where have you pinged the Xapian team? The upstream ticket has not been updated?
@kelson42 I pinged them on the Xapian irc. Should I update that ticket as well?
@maneeshpm No, thanks.
@mgautierfr I reactivate this ticket as it seem upstream Xapian dev is working on the topic. Could you please deliver him a feedback?
@mgautierfr Any feedback here?
Nothing to do on our side. At least about this specific issue and update on xapian side.
@kelson42 commented on Mar 12, 2017, 8:21 PM UTC:
Stopwords are words which should not be indexed (during the FT index process) and also be ignore during the FT search. This stopwords are language specific. Lists are provided by Xapian and are at least used in zimwriterfs. This task is about to check that everything works fine on the indexing part but also on the reader side.
This issue was moved by kelson42 from kiwix/kiwix-lib#26.