grosjo / fts-xapian

Dovecot FTS plugin based on Xapian
GNU Lesser General Public License v2.1
91 stars 19 forks source link

1.4.9: accent conversion fails with ICU 69 (ICU 69 bug?) #84

Closed nickalcock closed 3 years ago

nickalcock commented 3 years ago

This is with the tip of the Xapian RELEASE/1.4 branch, with what seems to be the recommended configuration, or close to it, upon doing an initial index of my INBOX:

plugin { fts = xapian fts_xapian = partial=3 full=20 verbose=0 fts_enforced = body fts_languages = en fts_language_config = /usr/share/libexttextcat/dovecot.conf fts_decoder = decode2text } service decode2text { executable = script /usr/libexec/dovecot/decode2text.sh user = dovecot unix_listener decode2text { mode = 0666 } }

Backtrace:

0 XNGram::add (this=this@entry=0x558e1a1b9c20, d=d@entry=0x7fff4f7cfce0)

at /usr/src/dovecot-fts-xapian/x86_64-loom/src/fts-backend-xapian-functions.cpp:377

1 0x00007f8a9f48c495 in XNGram::add (s=0x558d9e4ba614 "insanely", this=0x558e1a1b9c20)

at /usr/src/dovecot-fts-xapian/x86_64-loom/src/fts-backend-xapian-functions.cpp:356

2 fts_backend_xapian_index_text (backend=, uid=, field=, data=)

at /usr/src/dovecot-fts-xapian/x86_64-loom/src/fts-backend-xapian-functions.cpp:1157

3 0x00007f8a9f48d875 in fts_backend_xapian_update_build_more (_ctx=0x558d33540570, data=, size=)

at /usr/src/dovecot-fts-xapian/x86_64-loom/src/fts-backend-xapian.cpp:528

4 0x00007f8aa2831d70 in fts_build_full_words (last=false, size=,

data=0x558dd2666120 ">> Or maybe added. The whole system was getting on for worthy of WTF\n>> anyway. For some reason they decided to repla)
at /usr/src/dovecot/x86_64-loom/src/plugins/fts/fts-build-mail.c:402

5 fts_build_data (ctx=0x7fff4f7d0020, data=, size=, last=)

at /usr/src/dovecot/x86_64-loom/src/plugins/fts/fts-build-mail.c:423

6 0x00007f8aa2832577 in fts_build_body_block (last=false, block=0x7fff4f7cffb0, ctx=0x7fff4f7d0020)

at /usr/src/dovecot/x86_64-loom/src/plugins/fts/fts-build-mail.c:434

7 fts_build_mail_real (may_need_retry_r=0x7fff4f7cff53, retriable_err_msg_r=0x7fff4f7cff60, mail=0x7fff4f7cff90, update_ctx=0x558d33540570)

at /usr/src/dovecot/x86_64-loom/src/plugins/fts/fts-build-mail.c:579

8 fts_build_mail (update_ctx=0x558d33540570, mail=mail@entry=0x558d3353fb78)

at /usr/src/dovecot/x86_64-loom/src/plugins/fts/fts-build-mail.c:618

9 0x00007f8aa2837f3e in fts_mail_index (_mail=0x558d3353fb78) at /usr/src/dovecot/x86_64-loom/src/plugins/fts/fts-storage.c:550

10 fts_mail_precache (_mail=0x558d3353fb78) at /usr/src/dovecot/x86_64-loom/src/plugins/fts/fts-storage.c:571

11 0x00007f8aa33213ce in mail_precache (mail=0x558d3353fb78) at /usr/src/dovecot/x86_64-loom/src/lib-storage/mail.c:453

12 0x0000558d32e9ba1a in master_connection_input ()

13 0x00007f8aa315c0b8 in io_loop_call_io (io=0x558d334aa2f0) at /usr/src/dovecot/x86_64-loom/src/lib/ioloop.c:714

14 0x00007f8aa315d722 in io_loop_handler_run_internal (ioloop=ioloop@entry=0x558d3349e250)

at /usr/src/dovecot/x86_64-loom/src/lib/ioloop-epoll.c:222

15 0x00007f8aa315c161 in io_loop_handler_run (ioloop=0x558d3349e250) at /usr/src/dovecot/x86_64-loom/src/lib/ioloop.c:766

16 0x00007f8aa315c320 in io_loop_run (ioloop=0x558d3349e250) at /usr/src/dovecot/x86_64-loom/src/lib/ioloop.c:739

17 0x00007f8aa30d1623 in master_service_run (service=0x558d3349e0b0, callback=)

at /usr/src/dovecot/x86_64-loom/src/lib-master/master-service.c:853

18 0x0000558d32e9b3cd in main ()

(gdb) print accentsConverter $1 = (icu_69::Transliterator *) 0x0 (gdb) print status $3 = U_INVALID_ID

At the very least you should be checking to see whether the transliterator could be created.

This might very well be a bug in ICU, but I do wonder why it would have succeeded probably thousands of times and only now failed. Perhaps calling createInstance this often (rather than createInstance once and getInstance in future) causes some sort of resource exhaustion?

grosjo commented 3 years ago

Noted. Please kindly try again with latest git

nickalcock commented 3 years ago

Yep, caching the transliterator works much better and the crash is gone! It's about 50% faster at bulk indexes of mostly-smallish emails now, too, even with the decode2text snail script slowing everything down, and even given that spinning rust is involved. I suspect a profile would have shown about 90% of the thing's time was being spent creating transliterators, or that creating transliterators slows as O(n^2) as more are created, or something like that :)

Also, holy crap is this thing a lot faster than lucene. Like, instantaneous. Well worth paying the 50% index size increase...