slusarz / dovecot-fts-flatcurve

Dovecot FTS Flatcurve plugin (Xapian)
https://slusarz.github.io/dovecot-fts-flatcurve/
GNU Lesser General Public License v2.1
40 stars 8 forks source link

Issue with `email-address` tokenizer limit 245 characters #67

Closed h3ssan closed 1 month ago

h3ssan commented 1 month ago

Hey all,

The email-address tokenizer has a length limit of 245 characters long, it make sense btw, but many newspaper mail servers such as mailgun and sendgrid who uses really long email address, therefore it cause an issue while indexing via fts-flatcurve.

First, here's the configuration I used.

plugin {
    fts_autoindex = yes
    fts_autoindex_exclude = \Junk
    fts_autoindex_exclude2 = \Trash
    fts = flatcurve

    # These are not flatcurve settings, but required for Dovecot FTS. See
    # Dovecot FTS Configuration link above for further information.
    fts_languages = en es de
    fts_tokenizer_generic = algorithm=simple
    fts_tokenizers = generic email-address  # <---- here's the `email-address` tokenizer
    # ---------------------------------^

    # OPTIONAL: Recommended default FTS core configuration
    fts_filters = normalizer-icu snowball stopwords
    fts_filters_en = lowercase snowball english-possessive stopwords
}

After everything is working and healthy, I run the following command to rebuild the index:

docker compose exec dovecot-mailcow doveadm fts rescan -A && docker compose exec dovecot-mailcow doveadm index -A '*'

But, this warning (maybe error?) show up.

doveadm(info@example.com): Warning: fts-flatcurve(Trash): Could not write message data: uid=1; InvalidArgumentError: Term too long (> 245): Aau+mq6tanjqhftdijtjhu4wmojxgm3dizbygnrgkyzqmzsgmy3bhe4dqodfgvqtimdggnrgcjjugbzxo2lgoqxgozlomvzgc5dfmqtgqpldmzrwcn3ggbstsytfme3dizbsgbrgizjxgrrdqyzxmvrtqobrgytg2yljnrpwszb5gm2donbygizcm4r5nfxhizlsnzsxillqn5zxijjugbwgs5tffzqxijtuhusteqiudjlg@locald.res
doveadm(info@example.com): Warning: fts-flatcurve(Sent): Could not write message data: uid=2; InvalidArgumentError: Term too long (> 245): Aau+mq6tanjqhftdijtjhu4wmojxgm3dizbygnrgkyzqmzsgmy3bhe4dqodfgvqtimdggnrgcjjugbzxo2lgoqxgozlomvzgc5dfmqtgqpldmzrwcn3ggbstsytfme3dizbsgbrgizjxgrrdqyzxmvrtqobrgytg2yljnrpwszb5gm2donbygizcm4r5nfxhizlsnzsxillqn5zxijjugbwgs5tffzqxijtuhusteqi@survey.pledgebo
doveadm(info@example.com): Warning: fts-flatcurve(INBOX): Could not write message data: uid=1; InvalidArgumentError: Term too long (> 245): Aau+mq6tanjqhftdijtjhu4wmojxgm3dizbygnrgkyzqmzsgmy3bhe4dqodfgvqtimdggnrgcjjugbzxo2lgoqxgozlomvzgc5dfmqtgqpldmzrwcn3ggbstsytfme3dizbsgbrgizjxgrrdqyzxmvrtqobrgytg2yljnrpwszb5gm2donbygizcm4r5nfxhizlsnzsxillqn5zxijjugbwgs5tffzqxijtuhusteqi@survey.pledgebo

Example about the sender/receiver email is:

Aau+mq6tanjqhftdijtjhu4wmojxgm3dizbygnrgkyzqmzsgmy3bhe4dqodfgvqtimdggnrgcjjugbzxo2lgoqxgozlomvzgc5dfmqtgqpldmzrwcn3ggbstsytfme3dizbsgbrgizjxgrrdqyzxmvrtqobrgytg2yljnrpwszb5gm2donbygizcm4r5nfxhizlsnzsxillqn5zxijjugbwgs5tffzqxijtuhusteqiudjlg@locald.res

Thank you all!

slusarz commented 1 month ago

This is a duplicate of #62.

Edit: This is slightly different, because the email-address tokenizer doesn't allow the 'maxlen' setting, so there is theoretically no limit. Thus, this is less a bug in Dovecot core as opposed to a missing feature. However, this essentially leads to the same problem in flatcurve - a longer token than can be handled by Xapian is attempted to be indexed. So the proposed fix in #62 to workaround this would be the same.