slusarz / dovecot-fts-flatcurve

Dovecot FTS Flatcurve plugin (Xapian)
https://slusarz.github.io/dovecot-fts-flatcurve/
GNU Lesser General Public License v2.1
40 stars 8 forks source link

Search by some email addresses is not working #35

Closed amelentjev closed 2 years ago

amelentjev commented 2 years ago

I've discovered, that email addresses like j.ohn@doe.com or joh.n@doe.com cannot be found. Additional problem, that I can't force dovecot to use its index files for header search, if fts index has been built.

plugin configuration :

plugin {
    fts = flatcurve

    fts_enforced = no

    fts_autoindex_exclude  = \Junk
    fts_autoindex_exclude2 = \Trash

    fts_filters = normalizer-icu snowball stopwords
    fts_filters_en = lowercase snowball english-possessive stopwords
    fts_languages = ru en
    fts_tokenizers = generic email-address
    fts_tokenizer_generic = algorithm=simple
}
slusarz commented 2 years ago

I've discovered, that email addresses like j.ohn@doe.com or joh.n@doe.com cannot be found.

Can't reproduce.

Given this message:

From user@domain  Fri Feb 22 17:06:23 2008
From: user-from@domain.org
To: user-to@domain.org, test.dot@example.com

body

A search for "test.dot@example.com" returns the message:

root@13ba97ddc196:/# doveadm search -u user mailbox test text test.dot@example.com
7c9e31270f66376385010000c9769878 1

And verified that it is contained in the flatcurve index:

root@13ba97ddc196:/# doveadm fts-flatcurve dump -u user test | fgrep test.dot@example.com
test.dot@example.com count=1

I can't force dovecot to use its index files for header search, if fts index has been built.

Not quite sure what you mean, but regardless flatcurve has nothing to do with choosing how a search is performed. That is handled by core Dovecot code. flatcurve returns results for any query passed to it; it does not decide what that query is or how it is executed.

amelentjev commented 2 years ago

Problem is actual when exactly one symbol in email address is delimited by dot from other part of address. Examples are : a.melentjev@gmail.com j.ohn@doe.com and so on. Such emal addresses are used rather often, especially as corporate email.

I completely understand, that flatcurve has nothing to do how choosing how a search is performed. It's a side note, that problem is little wider, than it seems, because search for such email addresses (like j.ohn@doe.com) is completely useless, emails with them in any part (header,body) cannot be found if fts index has been built for mailbox.

slusarz commented 2 years ago

OK, it turned out the key reproduction fact was the need for the first part of the mailbox (i.e. the part before the ".") to be shorter than the minimum term size. In short, the issue dealt with the way that Xapian::QueryParser handled search strings that contain something other than "letter" characters. In certain cases, it would expand to phrase searching and this completely ignores the full e-mail addresses stored in the index and instead searches based on the smaller address tokens, which leads to some strange and incorrect results in certain situations. This has been fixed by removing the use of Xapian::QueryParser and instead manually creating the search query to Xapian using the Query primitives. This fixes the issue in this ticket as well as being simpler, easier to read code. So win-win.