slusarz / dovecot-fts-flatcurve

Dovecot FTS Flatcurve plugin (Xapian)
https://slusarz.github.io/dovecot-fts-flatcurve/
GNU Lesser General Public License v2.1
38 stars 8 forks source link

Searching for phrases with IMAP SEARCH vs. doveadm search, v0.2.0 vs. v0.3.0 #32

Closed edieterich closed 2 years ago

edieterich commented 2 years ago

When I search for a phrase, I get the same results with IMAP SEARCH with v0.2.0 and v0.3.0:

2 uid search body "search phrase"
* SEARCH 1003
2 OK Search completed (0.002 + 0.000 + 0.001 secs).

When I do this search with doveadm search, I get the same result with v0.2.0:

# v0.2.0
doveadm search -u ewald body "search phrase" mailbox inbox
1e52731e4b0660614b7201006a82f8f2 1003

WIth v0.3.0, I don't get any matches with doveadm search:

# v0.3.0
doveadm search -u ewald body "search phrase" mailbox inbox | wc -l
0

Can you confirm that this is a bug in v0.3.0? I would expect this to return the same result as with v0.2.0 and IMAP SEARCH.

slusarz commented 2 years ago

This message:

From: user-from@domain.org
To: user-to@domain.org
Subject: Foo

search phrase bar

IMAP (w/additional debug output):

a UID SEARCH body "search phrase"
Jul 14 18:58:07 imap(user)<956><MFBoe8jjmuR/AAAB>: Debug: fts-flatcurve(imaptest): Opened DB (RO) messages=1 version=1 shards=1
Jul 14 18:58:07 imap(user)<956><MFBoe8jjmuR/AAAB>: Debug: fts-flatcurve(imaptest): Query (body:search AND body:phrase*) maybe_matches=1 uids=1
Jul 14 18:58:07 imap(user)<956><MFBoe8jjmuR/AAAB>: Debug: fts-flatcurve(imaptest): Query (body:search*) matches=1 uids=1
Jul 14 18:58:07 imap(user)<956><MFBoe8jjmuR/AAAB>: Debug: fts-flatcurve(imaptest): Query (body:phras* OR body:phrase*) matches=1 uids=1
* SEARCH 1
a OK Search completed (0.036 + 0.000 + 0.035 secs).

doveadm (w/additional debug output):

root@30919c6bb87a:/dovecot# doveadm -D search -u user body "search phrase" mailbox imaptest
Jul 14 18:58:30 doveadm(user): Debug: fts-flatcurve(imaptest): Opened DB (RO) messages=1 version=1 shards=1
Jul 14 18:58:30 doveadm(user): Debug: fts-flatcurve(imaptest): Query (body:phras* OR body:phrase* AND body:search AND body:phrase*) maybe_matches=1 uids=1
Jul 14 18:58:30 doveadm(user): Debug: fts-flatcurve(imaptest): Query (body:search* AND body:search AND body:phrase*) maybe_matches=1 uids=1
Jul 14 18:58:30 doveadm(user): Debug: Mailbox imaptest: UID 1: Opened mail because: search
d8f0c0020560d062a60200005188a55f 1

So, works here.

I will note that the 2 searches ARE tokenizing the query in different ways. For the IMAP search, it is sending 3 different queries to flatcurve:

  1. "search phrase"
  2. "search"
  3. "phras" OR "phrase"

doveadm is sending only 2 queries:

  1. "search phrase" AND ("phras" OR "phrase")
  2. "search phrase" AND "search"

Frankly, the doveadm query looks a bit broken (although it works in flatcurve because we manually tokenize any phrases passed in as a query). But neither is ideal. Ideally, the query would be:

  1. exact match for "search phrase", OR, if the driver does not support phrase searching, a maybe match for "search" AND "phras*"

However, phrase searching is not currently supported in Dovecot FTS drivers due to Dovecot core limitations. See https://github.com/slusarz/dovecot-fts-flatcurve/issues/27#issuecomment-1130561953 . Example:

root@30919c6bb87a:/dovecot/fts-flatcurve# doveadm -D search -u user body "phrase search" mailbox imaptest
b02a9a3a3e67d062bc0300005188a55f 1

Even though the string "phrase search" does not appear in the message.

slusarz commented 2 years ago

Doing the above testing did make me realize that flatcurve doesn't have to handle phrases at all internally - it's just extra work that doesn't change the results of the query (the testing confirmed that both IMAP searches and doveadm searches will correctly pass all component terms into the query as well). So I've gone ahead and optimized by removing that code: https://github.com/slusarz/dovecot-fts-flatcurve/commit/6afe5f10a853a59fe98c3a94583781acdb579229.

edieterich commented 2 years ago

This turned out to be a problem with fts_languages = de en (no result with doveadm) vs. fts_languages = en de (same result with doveadm and IMAP SEARCH).

Anyway, I get the expected result with your latest change, no matter how I order fts_languages. Thanks.