slusarz / dovecot-fts-flatcurve

Dovecot FTS Flatcurve plugin (Xapian)
https://slusarz.github.io/dovecot-fts-flatcurve/
GNU Lesser General Public License v2.1
38 stars 8 forks source link

Segfault with Dovecot 1:2.3.21+dfsg1-2 from Debian #62

Open sergiodj opened 4 months ago

sergiodj commented 4 months ago

Hi,

It's been a few days now since I haven't been able to use flatcurve due to the error below:

dovecot[3384677]: indexer-worker(user)<3384699><UehlaS0T9MQAAAAAAAAAAAAAAAAAAAAB:wMLDAQeM62V7pTMAquyQZQ>: Warning: fts-flatcurve(INBOX): Could not write message data: uid=51735; InvalidArgumentError: Term too long (> 245): _znst10_hashtablei7qstringst4pairiks0_iesais3_enst8__detail10_select1stest8equal_tois0_est4hashis0_ens5_18_mod_range_hashingens5_20_default_ranged_hashens5_20_prime_rehash_policyens5_17_hashtable_traitsilb0elb0elb1eeee9_m_rehashe{size_t}rk{size_t}@ba
dovecot[3384677]: indexer-worker(user)<3384699><UehlaS0T9MQAAAAAAAAAAAAAAAAAAAAB:wMLDAQeM62V7pTMAquyQZQ>: Warning: fts-flatcurve(INBOX): Could not write message data: uid=51737; InvalidArgumentError: Term too long (> 245): 9qtprivate18qfunctorslotobjectist5_bindifmn5qcoro6detail17waitoperationbasei10qtcpservereefvnst7__n486116coroutine_handleiveeepns3_14qcorotcpserver29waitfornewconnectionoperationes9_eeli0ens_4listijeeeve4impleipns_15qslotobjectbaseep7qobjectppvpb@
dovecot[3384677]: indexer-worker: Error: terminate called after throwing an instance of 'std::bad_alloc'
dovecot[3384677]: indexer-worker: Error:   what():  std::bad_alloc
Mar 08 17:07:06 paluero dovecot[3384677]: imap(user)<3384697><UehlaS0T9MQAAAAAAAAAAAAAAAAAAAAB>: Error: Mailbox INBOX: indexer failed to index mailbox
Mar 08 17:07:06 paluero dovecot[3384677]: indexer-worker(user)<3384699><UehlaS0T9MQAAAAAAAAAAAAAAAAAAAAB:wMLDAQeM62V7pTMAquyQZQ>: Fatal: master: service(indexer-worker): child 3384699 killed with signal 6 (core dumped)

I haven't had the chance to investigate what's going on yet, but maybe this is a known issue?

allexmail commented 3 months ago

You may have interrupted indexing while indexing on this user. I would delete the entire fts-flatcurve folder in the user's mailbox and start indexing again. I had something similar when I forcibly interrupted indexing.

slusarz commented 3 months ago

Duplicate of #44

Somebody needs to provide a testcase to reproduce, since I can't.

This is the whole point of https://slusarz.github.io/dovecot-fts-flatcurve/configuration.html#fts_flatcurve_max_term_size so I'm not sure how this could happen. The size of a term defaults to 30 characters max, and is hardcoded to never be more than 200 characters. So no idea how something larger can be indexed.

sergiodj commented 3 months ago

You may have interrupted indexing while indexing on this user. I would delete the entire fts-flatcurve folder in the user's mailbox and start indexing again. I had something similar when I forcibly interrupted indexing.

Thanks. I tried deleting all fts-flatcurve directories from my ~/Mail dir, and then reissued a search, but I still see the problem.

sergiodj commented 3 months ago

Duplicate of #44

Somebody needs to provide a testcase to reproduce, since I can't.

This is the whole point of https://slusarz.github.io/dovecot-fts-flatcurve/configuration.html#fts_flatcurve_max_term_size so I'm not sure how this could happen. The size of a term defaults to 30 characters max, and is hardcoded to never be more than 200 characters. So no idea how something larger can be indexed.

Right. Here's my 90-fts.conf:

mail_plugins = $mail_plugins fts fts_flatcurve

plugin {
        fts = flatcurve

        fts_enforced = yes
        fts_autoindex = yes
        fts_languages = en pt
        fts_tokenizers = generic email-address
        fts_filters = lowercase normalizer-icu

        fts_flatcurve_max_term_size = 30
        fts_flatcurve_substring_search = yes
}

As I said above, I can reproduce the problem pretty easily on my mail directory, but unfortunately I don't know if there's another way to do it without having to provide my personal messages :-/.

slusarz commented 3 months ago

Does removing 'fts_flatcurve_max_term_size = 30' from your config help?

sergiodj commented 3 months ago

Unfortunately not. I still see the segmentation fault happening.

sergiodj commented 3 months ago

This is a partial backtrace:

#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007fd67c8781cf in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007fd67c82a472 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007fd67c8144b2 in __GI_abort () at ./stdlib/abort.c:79
#4  0x00007fd67baa0a2d in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95
#5  0x00007fd67bab1f5a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:48
#6  0x00007fd67baa05d9 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:58
#7  0x00007fd67bab21d8 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7fd67bc54bc0 <typeinfo for std::bad_alloc>, dest=0x7fd67bab0510 <std::bad_alloc::~bad_alloc()>)
    at ../../../../src/libstdc++-v3/libsupc++/eh_throw.cc:98
#8  0x00007fd67baa0649 in operator new (sz=sz@entry=96) at ../../../../src/libstdc++-v3/libsupc++/new_op.cc:54
#9  0x00007fd679882575 in std::__new_allocator<std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> > >::allocate (
    this=<optimized out>, __n=1) at /usr/include/c++/13/bits/new_allocator.h:151
#10 std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> > > >::allocate (__n=1, __a=...)
    at /usr/include/c++/13/bits/alloc_traits.h:482
#11 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> > >::_M_get_node (this=0x56355cbdd140)
    at /usr/include/c++/13/bits/stl_tree.h:563
#12 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> > >::_M_create_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, OmDocumentTerm> > (this=0x56355cbdd140) at /usr/include/c++/13/bits/stl_tree.h:613
#13 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> > >::_Auto_node::_Auto_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, OmDocumentTerm> > (__t=..., this=<synthetic pointer>) at /usr/include/c++/13/bits/stl_tree.h:1637
#14 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> > >::_M_emplace_hint_unique<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, OmDocumentTerm> > (this=this@entry=0x56355cbdd140, __pos=__pos@entry={...}) at /usr/include/c++/13/bits/stl_tree.h:2462
#15 0x00007fd679881a46 in std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, OmDocumentTerm, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> > >::emplace_hint<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, OmDocumentTerm> > (__pos=..., this=0x56355cbdd140) at /usr/include/c++/13/bits/stl_map.h:638
#16 std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, OmDocumentTerm, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, OmDocumentTerm> > >::insert<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, OmDocumentTerm> > (__x=..., this=0x56355cbdd140) at /usr/include/c++/13/bits/stl_map.h:860
#17 Xapian::Document::Internal::add_term (this=0x56355cbdd0d0, tname=..., wdfinc=<optimized out>) at ../api/omdocument.cc:502
#18 0x00007fd679881b0b in Xapian::Document::add_term (this=<optimized out>, tname=..., wdfinc=<optimized out>) at ../api/omdocument.cc:146
#19 0x00007fd67c9ee261 in fts_flatcurve_xapian_index_body (ctx=ctx@entry=0x563559ba0458, data=<optimized out>, size=<optimized out>) at fts-backend-flatcurve-xapian.cpp:1308
#20 0x00007fd67c9e84fa in fts_backend_flatcurve_update_build_more (_ctx=0x563559ba0458, data=<optimized out>, size=<optimized out>) at fts-backend-flatcurve.c:328
#21 0x00007fd67c741709 in fts_build_add_tokens_with_filter (ctx=ctx@entry=0x7ffc684e1740, 
    data=data@entry=0x56355a5e5400 "\n- _ZN37QgsAbstractDatabaseProviderConnection13TableProperty17setGeometryColumnERK7QString@Base 3.10.2\n- _ZN37QgsAbstractDatabaseProviderConnection13TableProperty20setPrimaryKeyColumnsERK11QStringList"..., size=size@entry=8138) at /build/reproducible-path/dovecot-2.3.21+dfsg1/src/plugins/fts/fts-build-mail.c:273
#22 0x00007fd67c741958 in fts_build_tokenized (last=false, size=8138, 
    data=0x56355a5e5400 "\n- _ZN37QgsAbstractDatabaseProviderConnection13TableProperty17setGeometryColumnERK7QString@Base 3.10.2\n- _ZN37QgsAbstractDatabaseProviderConnection13TableProperty20setPrimaryKeyColumnsERK11QStringList"..., ctx=0x7ffc684e1740) at /build/reproducible-path/dovecot-2.3.21+dfsg1/src/plugins/fts/fts-build-mail.c:349
#23 fts_build_data (ctx=ctx@entry=0x7ffc684e1740, 
slusarz commented 3 months ago

This crash indicates memory allocations is causing out-of-memory errors. You've increased memory for the indexer from the default?

Otherwise, not very useful as all the function data has been optimized out. You can try compiling again with optimization flags to see if that provides more info in the stack trace.

Really would like to know if it is a particular message that is causing the crash, or if you are simply hitting memory errors because your messages are so large. (Nothing you can do there except increase memory or reduce shard sizes. In out-of-memory case, a segfault is expected and the correct behavior.)

sergiodj commented 3 months ago

If by memory limit you mean the fts_flatcurve_commit_limit option, then yes, I'm using a value of 5000.

I've been meaning to recompile and continue debugging the problem further but unfortunately I don't have the time right now. Maybe on the weekend. I'll provide more info when I have it.

slusarz commented 3 months ago

Try something lower, like 1000.

If you have large messages (with lots of indexing data), Xapian can use more than 256MB (default vsz_limit) of memory, which will cause out-of-memory issues. A lower number will ensure that less memory is used before the data is swapped to disk, at the expense of additional I/O.

slusarz commented 3 months ago

Actually, commit_limit might be the even better setting to try a lower value.

https://slusarz.github.io/dovecot-fts-flatcurve/configuration.html#fts_flatcurve_commit_limit

GustavoSatig commented 3 months ago

I've had a similar problem

This is the errors

Apr 01 23:04:41 indexer-worker(user@domain.com.br)<31534><8JOINDQdMmQuewAAyMdWHQ>: Warning: fts-flatcurve(Lixeira): Could not write message data: uid=58763; InvalidArgumentError: Term too long (> 245): //secure.domain.com.br/email/cancelaraviso/1853889ed111cdeac6a85a4a58191e0d3f9ed30553562d76a8fd0e4536e9e105641dd4543d3dcf835d04e911bfc64949b5350821ce593f792cde1c2d1b2b1da2/njjjnmqwztmtnjayyy00mjvkltk0otatmte2ytg0mwe3m2u5.html?email=user@domain.com.br

Apr 01 23:05:24 indexer-worker(usera@domainc.com.br)<31534><CFzBJWQdMmQuewAAyMdWHQ>: Warning: fts-flatcurve(Enviadas): Could not write message data: uid=4446; InvalidArgumentError: Term too long (> 245): //secure.domain.com.br/email/cancelaraviso/0bec6827d1b4984841266bf56c7ecd6b6f554679213f8826abeedd6ba9e3b6556d088a98515f715b0e899112bd75a4080959fb6b779e89d30337130871fbbe39/yjvlntqzngytmjnjmi00ndi4lwi5ztitogizyzhmntzjmwu0.html?email=user@domainc.com.br

Apr 01 23:05:25 indexer-worker(usera@domainc.com.br)<31534><CFzBJWQdMmQuewAAyMdWHQ>: Warning: fts-flatcurve(Enviadas): Could not write message data: uid=4560; InvalidArgumentError: Term too long (> 245): //secure.domain.com.br/email/viewblob/959e8ee84f7e63458bb0cb91167ce58e865b21b14a0c0c03fbdb8d28f3322f7fbcc0ec8716c8a3b8d3083ef7bba2fcd4b8764c42472ecd30100b74223ab15ffa/zdgwzthhy2etmduzmy00ngfhlwjiogqtzwvjzmq0mwezmjyz.html?email=user@domaind.com.br

Apr 01 23:05:59 indexer-worker(user@domainc.com.br)<31534><IDKFN4YdMmQuewAAyMdWHQ>: Warning: fts-flatcurve(INBOX/CONTRATOS): Could not write message data: uid=17; InvalidArgumentError: Term too long (> 245): //secure.domain.com.br/email/viewblob/b5fbfdb70178e658c46cfa532031624305195d3c2780c6033d6c65f8a16f960557918aa5e9f679deb27c0b531df0107b6d6c59f5f86405e6b397efe9d0481dec/ntg0ngexnmitmje4yi00zguxlthintqtztdlodnlzta0mwjm.html?email=user@domainf.com.br

Here is my 90-fts.conf

plugin {
  fts = flatcurve
  fts_autoindex = yes
  fts_enforced = yes
  fts_languages = pt en es
  fts_tokenizer_generic = algorithm=simple
  fts_tokenizers = generic email-address
  fts_filters = normalizer-icu lowercase stopwords
  fts_filters_en = lowercase snowball english-possessive stopwords
  fts_flatcurve_commit_limit = 50
  fts_flatcurve_max_term_size = 30
  fts_flatcurve_min_term_size = 2
  fts_flatcurve_substring_search = no
  fts_index_timeout = 60s
  fts_header_excludes = *
  fts_header_includes = Date From To Cc Bcc Subject Content-Type
  fts_autoindex_max_recent_msgs = 100
}
edieterich commented 2 months ago

I think that Xapian::Utf8Iterator::raw() doesn't work as expected. The following is with Xapian 1.4.22 on Debian 12.

I have fts_flatcurve_max_term_size set to 20:

doveconf | grep fts_flatcurve_max_term_size
  fts_flatcurve_max_term_size = 20

Test mail (no headers, just a body):


gesangsvereinsbuchausleihgesellschaft
1234567890123456789012345678901234567890

I added some debug logging in fts_flatcurve_xapian_index_body():

diff --git a/src/fts-backend-flatcurve-xapian.cpp b/src/fts-backend-flatcurve-xapian.cpp
index 77d8aaa..bb62041 100644
--- a/src/fts-backend-flatcurve-xapian.cpp
+++ b/src/fts-backend-flatcurve-xapian.cpp
@@ -1299,6 +1299,9 @@ fts_flatcurve_xapian_index_body(struct flatcurve_fts_backend_update_context *ctx
        do {
                std::string t (ustr.raw());

+               i_debug("data=%s size=%lu t=%s ustr.raw()=%s ustr.left()=%lu",
+                       (const char *)data, (unsigned long)size, t.c_str(), ustr.raw(), (unsigned long)ustr.left());
+
                /* Capital ASCII letters at the beginning of a Xapian term are
                 * treated as a "term prefix". Check for a leading ASCII
                 * capital, and lowercase if necessary, to ensure the term

Delivering the mail:

doveadm save -u user < mail

Dumping the index:

doveadm fts-flatcurve dump -u user INBOX
123456789012345678901234567890 count=1
gesangsvereinsbuchausleihgesel count=1

It saved 30 characters for both words, not 20.

Debug logging:

Debug: data=gesangsvereinsbuchausleihgesel size=20 t=gesangsvereinsbuchausleihgesel ustr.raw()=gesangsvereinsbuchausleihgesel ustr.left()=20
Debug: data=123456789012345678901234567890 size=20 t=123456789012345678901234567890 ustr.raw()=123456789012345678901234567890 ustr.left()=20

So Dovecot core sent 30 characters (data).

ustr = Xapian::Utf8Iterator((const char *)data, size);

This limits the iterator to 20 characters (ustr.left()), but not the raw data of the iterator (ustr.raw()).

I don't know where the limit of 30 characters from Dovecot core comes from. Maybe there is a scenario where Dovecot core sends more than 30 characters, more than 245 even?

slusarz commented 2 months ago

Thank you for debug help @edieterich ... you are correct that Utf8Iterator usage does not appear to be correct and that should be looked at.

...but with that being said, this is still irrelevant for purposes of debugging this ticket. As you noted, it's not just flatcurve that does max term limitation. It's also tokenization. By default, generic tokenizer limits to 30-byte long tokens (see maxlen):

https://doc.dovecot.org/settings/plugin/fts-plugin/#plugin_setting-fts-fts_tokenizers

So by the time flatcurve processes the string, there have been 2 limitation gates that prevent it from being more than 30 bytes long.

If I set fts_tokenizer_generic = algorithm=simple maxlen=500, I can trigger a too long error:

May 03 17:41:00 indexer-worker(user)<357><GKgOOZAXSqR/AAAB:uDz6EKwhNWZlAQAAxGKukQ>: Warning: fts-flatcurve(INBOX): Could not write message data: uid=4; InvalidArgumentError: Term too long (> 245): aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

But this is expected behavior, and no crash/segfault. (This is on the test Debian image, which is currently bullseye-slim).

So I see two things that can be improved, but don't have anything to do with segfaults:

Only thing I can think of to additionally investigate is that maybe Dovecot fts tokenizer is potentially splitting UTF-8 character when it enforces max length, and maybe that causes problems? But that would be a core issue, not a flatcurve one.

slusarz commented 2 months ago

First, answering my own question, but the generic tokenizer IS UTF-8 aware and will correctly handle a split UTF-8 character at the split point.

Also, it turns out the Dovecot team already removed use of Utf8Iterator in the new 2.4 code similarly to how I changed things here.

Anyway, maybe somebody can test this new code. Since I can't reproduce the original issue, I have no idea whether this helps or not.

slusarz commented 1 month ago

This code was committed, and a new release was pushed almost a month ago. Haven't heard any response in this ticket, so the assumption is that these changes fixed the issue(s). Closing ticket.

sergiodj commented 1 month ago

I just tried compiling the master branch, and unfortunately I'm seeing seeing errors when indexing my INBOX:

May 29 20:44:43 dovecot[2965329]: indexer-worker: Error: terminate called after throwing an instance of 'std::length_error'
May 29 20:44:43 dovecot[2965329]: indexer-worker: Error:   what():  basic_string::_M_create
May 29 20:44:43 dovecot[2965329]: imap(user)<2965343><17URJqEZ3IoAAAAAAAAAAAAAAAAAAAAB>: Error: Mailbox INBOX: indexer failed to index mailbox
May 29 20:44:43 dovecot[2965329]: indexer-worker(user)<2965345><17URJqEZ3IoAAAAAAAAAAAAAAAAAAAAB:fI6PGJPLV2ZhPy0AquyQZQ>: Fatal: master: service(indexer-worker): child 2965345 killed with signal 6 (core dumped)

The message is different than the one I was seeing before, but the outcome is still a segfault.

edieterich commented 2 weeks ago

The generic tokenizer doesn't always respect maxlen and UTF-8 character boundaries. Take the attached crash_data.txt file. I'm not sure what this is, but it's part of a mail that crashed Flatcurve with same error as sergiodj's.

maxlen is the default, 30 characters:

doveadm fts tokenize 1234567890123456789012345678901234567890
123456789012345678901234567890

With crash_data.txt I get a token longer than 30 characters:

doveadm fts tokenize "$(cat crash_data.txt)"
������6���
�����
����������
������6�����������������������������?�����@������������������������������

To crash Flatcurve, configure substring search, otherwise it doesn't crash:

pluging {
  fts_flatcurve_substring_search = yes
}

Now save crash_data.txt as a header or a body to crash Flatcurve in fts_flatcurve_xapian_index_header or fts_flatcurve_xapian_index_body:

echo "Subject: $(cat crash_data.txt)" | sudo doveadm save -u user
echo -e "\r\n$(cat crash_data.txt)" | sudo doveadm save -u user

I added some debug logging in fts_flatcurve_xapian_index_body just before and after size -= csize;:

...
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header size before: 200
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header size after: 197
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header csize: 3
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header size before: 197
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header size after: 194
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header csize: 3
...
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header csize: 3
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header size before: 2
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Error: fts_flatcurve_xapian_index_header size after: 4294967295
Jun 19 16:45:17 indexer-worker: Error: terminate called after throwing an instance of 'std::length_error'
Jun 19 16:45:17 indexer-worker: Error:   what():  basic_string::_M_create
Jun 19 16:45:17 indexer-worker(user)<345624><QmRkJf3ucmYURgUAaoL48g:AdCBJv3ucmYYRgUAaoL48g>: Fatal: master: service(indexer-worker): child 345624 killed with signal 6 (core dumped)

So I get a 200 characters long token (instead of the expected maxlen 30) that is truncated at the last 3-byte UTF-8 character after 2 bytes, leading to an unsigned integer underflow, leading to a crash.

I don't think there's much you can do except for preventing the unsigned integer underflow. The real problem is in the tokenizer.

edieterich commented 2 weeks ago

Splitting the last multibyte UTF-8 character is caused by fts_backend_flatcurve_update_build_more:

size = I_MIN(size, FTS_FLATCURVE_MAX_TERM_SIZE);