Crash while indexing - terminate called after throwing an instance of 'std::bad_alloc'

richardshanasy commented 3 years ago

Hi guys, fts-xapian is awesome, so much easier to set up. It's working with all the accounts except one. My one. I've got 50,000+ emails in "All Mail" folder. It's causing a crash. Below is from the mail.log. I tried to fine the core dump (it isn't in /var/crash). Any help would be much appreciated. Even if you could help me find the problem email I'm very happy to just delete it.

Cheers,

Rich

Oct 17 19:42:35 mail dovecot: indexer-worker(r)<45725><b8X/2dmx9K1/AAAB:UlT8Ilquil+dsgAAZU03Dg>: FTS Xapian: fts_backend_xapian_update_build_more
Oct 17 19:42:35 mail dovecot: indexer-worker(r)<45725><b8X/2dmx9K1/AAAB:UlT8Ilquil+dsgAAZU03Dg>: FTS Xapian: Query= uid:"192"
Oct 17 19:42:35 mail dovecot: indexer-worker(r)<45725><b8X/2dmx9K1/AAAB:UlT8Ilquil+dsgAAZU03Dg>: FTS Xapian: NGRAM(body,XBDY) -> 652 items, max length=30, (total 6 KB)
Oct 17 19:42:36 mail dovecot: indexer-worker(r)<45725><b8X/2dmx9K1/AAAB:UlT8Ilquil+dsgAAZU03Dg>: FTS Xapian: fts_backend_xapian_update_build_more
Oct 17 19:42:36 mail dovecot: indexer-worker(r)<45725><b8X/2dmx9K1/AAAB:UlT8Ilquil+dsgAAZU03Dg>: FTS Xapian: Query= uid:"192"
Oct 17 19:42:45 mail dovecot: indexer-worker(r)<45725><b8X/2dmx9K1/AAAB:UlT8Ilquil+dsgAAZU03Dg>: FTS Xapian: NGRAM(body,XBDY) -> 65794 items, max length=62, (total 711 KB)
Oct 17 19:42:45 mail dovecot: indexer-worker(r)<45725><b8X/2dmx9K1/AAAB:UlT8Ilquil+dsgAAZU03Dg>: FTS Xapian: fts_backend_xapian_update_build_more
Oct 17 19:42:45 mail dovecot: indexer-worker(r)<45725><b8X/2dmx9K1/AAAB:UlT8Ilquil+dsgAAZU03Dg>: FTS Xapian: Query= uid:"192"
Oct 17 19:42:45 mail dovecot: indexer-worker(r)<45725><b8X/2dmx9K1/AAAB:UlT8Ilquil+dsgAAZU03Dg>: FTS Xapian: NGRAM(body,XBDY) -> 10 items, max length=6, (total 0 KB)
Oct 17 19:42:46 mail dovecot: indexer-worker(r)<45725><b8X/2dmx9K1/AAAB:UlT8Ilquil+dsgAAZU03Dg>: FTS Xapian: fts_backend_xapian_update_build_more
Oct 17 19:42:46 mail dovecot: indexer-worker(r)<45725><b8X/2dmx9K1/AAAB:UlT8Ilquil+dsgAAZU03Dg>: FTS Xapian: Query= uid:"192"
Oct 17 19:42:55 mail dovecot: indexer-worker(r)<45725><b8X/2dmx9K1/AAAB:UlT8Ilquil+dsgAAZU03Dg>: FTS Xapian: NGRAM(body,XBDY) -> 66798 items, max length=63, (total 722 KB)
Oct 17 19:42:55 mail dovecot: indexer-worker: Error: terminate called after throwing an instance of 'std::bad_alloc'
Oct 17 19:42:55 mail dovecot: indexer-worker: Error:   what():  std::bad_alloc
Oct 17 19:42:56 mail dovecot: indexer: Error: Indexer worker disconnected, discarding 1 requests for r
Oct 17 19:42:56 mail dovecot: indexer-worker(r)<45725><b8X/2dmx9K1/AAAB:UlT8Ilquil+dsgAAZU03Dg>: Fatal: master: service(indexer-worker): child 45725 killed with signal 6 (core dumped)
Oct 17 19:42:56 mail dovecot: imap(r)<45720><b8X/2dmx9K1/AAAB>: Error: indexer failed to index mailbox All Mail
Oct 17 19:43:15 mail dovecot: imap(r)<45720><b8X/2dmx9K1/AAAB>: Logged out in=78 out=1244 deleted=0 expunged=0 trashed=0 hdr_count=0 hdr_bytes=0 body_count=0 body_bytes=0
Oct 17 19:43:15 mail dovecot: imap(r)<45720><b8X/2dmx9K1/AAAB>: FTS Xapian: fts_backend_xapian_deinit
Oct 17 19:43:15 mail dovecot: imap(r)<45720><b8X/2dmx9K1/AAAB>: Deinit /home/r/mail/.imap/xapian-indexes
Oct 17 19:43:15 mail dovecot: imap(r)<45720><b8X/2dmx9K1/AAAB>: FTS Xapian: Unset box 'All Mail' (e8ee520f6aaa795f17590000654d370e)
Oct 17 19:43:15 mail dovecot: imap(r)<45720><b8X/2dmx9K1/AAAB>: FTS Xapian: Committed 'unset_box' in 0 ms

ams001 commented 3 years ago

@grosjo Yes, I am interested - I agree it is fast searching and adding to the index. But the amount of memory and storage for the indexes is pretty high. FYI, one user's account is 5.7GB but the xapian index was 11.2GB - under solr their search index is 626MB. Plus, as mentioned, the time for the initial index is very high.

The site I am currently working with, their entire mail usage is 31.7GB, xapian-index totalled 30.8GB. The same site solr index totals 2.7GB. Yes, disk space is cheap, but...

I am suspicious the disk usage of fts-xapian and suspect it is not cleaning up after the errors which may be why the size is so high. I would be interested in other comparisons.

I am also loathe to set vsx_limit to 0. As mentioned previously, I killed indexer-worker off when it was consuming 12.7GB of the 16GB of physical memory. That's not acceptable in a small production server. I know a lot of other VM hosts where we co-locate provide 4GB per host.

grosjo commented 3 years ago

Ok, please try latest git

To test it, you shall set

service indexer-worker {
    vsz_limit = 0
}

but in the config, I added a "maxmem=" parameter . For instance : maxmem=3.2 means "no more than 3.2 GB"

ams001 commented 3 years ago

Pulled, rebuilt, added maxmem=4.0, switched vsz_limit = 0 and restarted an index.

Oddly, it does not report the maxmem parameter:

doveadm(aaaaa@example.com): Info: FTS Xapian: Starting with partial=3 full=20 attachments=0 verbose=1
doveadm(aaaaa@example.com): Fatal: write(indexer) failed: Resource temporarily unavailable

I went back and changed partial from 4 to 3 and restarted just to be sure that I really had made the change.

Note: The Fatal: write(indexer) failed message disappears when I switch to solr, so its definitely a fts-xapian thing.

grosjo commented 3 years ago

@ams001 I fixed the display

for the other issue, let's talk about it in https://github.com/grosjo/fts-xapian/issues/62

ams001 commented 3 years ago

I've just updated git, rebuilt, and doveadm now cores:

Core was generated by `doveadm index -u aaaaa@example.com *'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f55b0c69f5b in fgets () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f55b0c69f5b in fgets () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f55ae6f1ae6 in fts_backend_xapian_memory_used () at fts-backend-xapian-functions.cpp:480
#2  0x00007f55ae6f8207 in fts_backend_xapian_test_memory (m=4194304) at fts-backend-xapian-functions.cpp:510
#3  fts_backend_xapian_update_build_more (_ctx=0x55e492a828e0, data=<optimized out>, size=<optimized out>) at fts-backend-xapian.cpp:519
#4  0x00007f55b0b9bc4b in ?? () from /usr/lib/dovecot/modules/lib20_fts_plugin.so
#5  0x00007f55b0b9c4bd in ?? () from /usr/lib/dovecot/modules/lib20_fts_plugin.so
#6  0x00007f55b0b9cbbd in fts_build_mail () from /usr/lib/dovecot/modules/lib20_fts_plugin.so
#7  0x00007f55b0ba28b0 in ?? () from /usr/lib/dovecot/modules/lib20_fts_plugin.so
#8  0x00007f55b1173344 in mail_precache () from /usr/lib/dovecot/libdovecot-storage.so.0
#9  0x000055e491bf054f in ?? ()
#10 0x000055e491bea93f in ?? ()
#11 0x000055e491beb58d in ?? ()
#12 0x000055e491bec389 in doveadm_cmd_ver2_to_mail_cmd_wrapper ()
#13 0x000055e491bfcb1d in doveadm_cmd_run_ver2 ()
#14 0x000055e491bfcb77 in doveadm_cmd_try_run_ver2 ()
#15 0x000055e491bdbb21 in main ()
(gdb)

The log:

doveadm(aaaaa@example.com): Info: FTS Xapian: Query= uid:"334"
doveadm(aaaaa@example.com): Info: FTS Xapian: Ngram(A) -> 70 items (total 0 KB)
doveadm(aaaaa@example.com): Info: FTS Xapian: Memory stats : Used = 93 MB (2%), Limit = 2730 MB (66% of 4096 MB)
doveadm(aaaaa@example.com): Info: FTS Xapian: Query= uid:"334"
doveadm(aaaaa@example.com): Info: FTS Xapian: Ngram(XTO) -> 97 items (total 0 KB)
doveadm(aaaaa@example.com): Info: FTS Xapian: Memory stats : Used = 93 MB (2%), Limit = 2730 MB (66% of 4096 MB)
doveadm(aaaaa@example.com): Info: FTS Xapian: Query= uid:"334"
doveadm(aaaaa@example.com): Info: FTS Xapian: Ngram(XMID) -> 6 items (total 0 KB)
Segmentation fault (core dumped)

I also noted 1000's of filehandles lost - you need an fclose(file) at src/fts-backend-xapian-functions.cpp:488

ams001 commented 3 years ago

@grosjo I've just finished reviewing the logs prior to my update (and core dump report) this morning, and something is still not right.

I've been looking at the sizes of the xapian-indexs directories for each user, and they vary by several hundred MB (occasionally into the GB) between runs each time I clear and reindex. I'm guessing here, but I think whats happening is because auto-indexing is on, when a mailbox if being indexed, if an email is delivered to the mailbox, the index is restarted but there is no cleanup and some files are being left behind. This guess is based on the fact that overnight runs, when email is quieter, tended to have less disk space used by the index. During busier periods, the index sizes grew unreasonably large. For example, last night's run on one user resulted in an index size of 837MB whereas previously their index size was 11.2GB. Under solr the index size is 626MB so comparable.

grosjo commented 3 years ago

Can you type "cat /proc/self/status" in a command line on your server ?

ams001 commented 3 years ago

# cat /proc/self/status
Name:   cat
Umask:  0022
State:  R (running)
Tgid:   6886
Ngid:   0
Pid:    6886
PPid:   10897
TracerPid:  0
Uid:    0   0   0   0
Gid:    0   0   0   0
FDSize: 256
Groups: 0 
NStgid: 6886
NSpid:  6886
NSpgid: 6886
NSsid:  10897
VmPeak:     5400 kB
VmSize:     5400 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:       688 kB
VmRSS:       688 kB
RssAnon:          64 kB
RssFile:         624 kB
RssShmem:          0 kB
VmData:      312 kB
VmStk:       132 kB
VmExe:        28 kB
VmLib:      1432 kB
VmPTE:        48 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
Threads:    1
SigQ:   3/63819
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp:    0
Speculation_Store_Bypass:   thread vulnerable
Cpus_allowed:   ff
Cpus_allowed_list:  0-7
Mems_allowed:   00000000,00000001
Mems_allowed_list:  0
voluntary_ctxt_switches:    1
nonvoluntary_ctxt_switches: 0

grosjo commented 3 years ago

So, how come you get a core dump ?

grosjo commented 3 years ago

I added a check on the fopen, and added the missing fclose. Would that be the cause of the coredump ? (too many file handles)

ams001 commented 3 years ago

I added a check on the fopen, and added the missing fclose. Would that be the cause of the coredump ? (too many file handles)

Yes, almost certainly. If fopen() failed because there were no more filehandles it returns NULL, so fgets with a NULL filehandle is the culprit.

grosjo commented 3 years ago

ok

So is the low memory properly handled on your server ?

(set maxmem=1 for instance, and relaunch indexing of a large mailbox)

grosjo commented 3 years ago

@grosjo I've just finished reviewing the logs prior to my update (and core dump report) this morning, and something is still not right.

I've been looking at the sizes of the xapian-indexs directories for each user, and they vary by several hundred MB (occasionally into the GB) between runs each time I clear and reindex. I'm guessing here, but I think whats happening is because auto-indexing is on, when a mailbox if being indexed, if an email is delivered to the mailbox, the index is restarted but there is no cleanup and some files are being left behind. This guess is based on the fact that overnight runs, when email is quieter, tended to have less disk space used by the index. During busier periods, the index sizes grew unreasonably large. For example, last night's run on one user resulted in an index size of 837MB whereas previously their index size was 11.2GB. Under solr the index size is 626MB so comparable.

1 - To cleanup your indexes, use the "rescan" commadn in doveadm (it justs delete the xapian-indexes folder) 2 - Yes, disk space is much larger than solr. This is how it is, and I can not change Xapian. This plug-in is targeted to be simple and fast and maintenance free, but larger emails providers shall probably dig into solr or other industrial setup. I won't change that.

grosjo commented 3 years ago

I removed the maxmem setting and used an interfal fonction to determine the vsz value.

Can you test it now ? (putting vsz_limit rather low and launching rescan of the boxes)

ams001 commented 3 years ago

I removed the maxmem setting and used an interfal fonction to determine the vsz value.

Can you test it now ? (putting vsz_limit rather low and launching rescan of the boxes)

I'm testing it now.

1 - To cleanup your indexes, use the "rescan" commadn in doveadm (it justs delete the xapian-indexes folder) That is what I have been doing. I've also got a script that periodically runs du -sm on the xapian index directories and logs their usage so I can see the disk space.

2 - Yes, disk space is much larger than solr. This is how it is, and I can not change Xapian. This plug-in is targeted to be simple and fast and maintenance free, but larger emails providers shall probably dig into solr or other industrial setup. I won't change that.

I think you may have misunderstood my concern. Its not just the size, its the variation in the size between test runs with almost identical data: I have reindexed the same (idle - they are on vacation) user's mailbox, all folders, several times. On each run, for the same user, the resulting size of the xapian index directory has varied greatly from a couple of hundred MB to over 10 GB, up and down and back again. This is the same data, maybe with a couple more emails each time. I would expect the initial xapian index size to be roughly the same each time. Not a couple of GB less, or more, between each run.

I mentioned previously one user's emil account was 5.7GB but the xapian index was 11.2GB. Between runs the index sizes after running the initial index were 2.9GB, 8.8GB, 4.7GB, 11.2GB, 5.6GB and 824MB. Yes, your code was slightly different each time. It could well be these changes had an effect, but just in case not I am going to reindex all the users over this weekend several times and compare each resulting xapian index size. I would expect the sizes to be roughly the same, with only changes resulting from additional emails received.

Its also not just one user. Its all of them.

Maybe its garbage collection of some kind - I dont know. But the variation of the xapian index sizes is enough to stand out and for me to raise an eyebrow.

grosjo commented 3 years ago

For the size change, there are 2 things

As you experienced some crashes, I do not know whether this had an impact. Cleaning up after crash should be the baseline.
Expunges (and therefore deletion of index) is time consuming in Xapian DB. So the process in the code is as follow : (a) Upon expunge command, the email is flagged as "to be expunged" to avoid consuming time (b) When the user run the "expunge" command manually (i.e. doveadm) all expunged are removed (c) Expunges are removed dynamically, when a new email comes in (so it is "auto maintained").

Also, Xapian has a "compress" process of the database, but I am not sure when it is triggered (I will dig into that).

slavkoja commented 3 years ago

@grosjo see https://getting-started-with-xapian.readthedocs.io/en/latest/advanced/admin_notes.html#compacting-a-database

From my memory, compacting the database is possible, but xapian doesn't doing this and do not recommends this at all, as it is reusing DB space from deleted items (data). One can use xapian-compact tool, I used it more times (not dovecot related), the DB shrink was significant (i have no exact numbers), but temporary -- as my content is changing often...

grosjo commented 3 years ago

@ams001 @richardshanasy Are we good on this issue ?

ams001 commented 3 years ago

Not from my side, unfortunately. You folks are too quick to assume things are fixed. I have a dataset that takes around 36 hours to process, so if you would like feedback from me, you will have to wait.

Two more of these errors, the last of which generated a core:

Nov  8 00:50:22 fawkes dovecot: indexer-worker(aaaaa@example.com)<2573><yBz3D0AZp18NCgAA+4JX8g>: Error: FTS Xapian: Memory error 'std::bad_alloc'
...
Nov  8 03:42:33 fawkes dovecot: indexer-worker(bbbbb@example.com)<24232><eC2AFgJop1+oXgAA+4JX8g>: Error: FTS Xapian: Memory error 'std::bad_alloc'
Nov  8 03:42:34 fawkes dovecot: indexer-worker(bbbbb@example.com)<24232><eC2AFgJop1+oXgAA+4JX8g>: Error: Folder: Mail search failed: Internal error occurred. Refer to server log for more information. [2020-11-08 03:37:38]
Nov  8 03:42:34 fawkes dovecot: indexer-worker(bbbbb@example.com)<24232><eC2AFgJop1+oXgAA+4JX8g>: Error: Folder: Transaction commit failed: FTS transaction commit failed: transaction context (attempted to index 2085 messages (UIDs 1..2085))
Nov  8 03:42:35 fawkes dovecot: indexer-worker(bbbbb@example.com)<24232><AEMlMytpp1+oXgAA+4JX8g>: Error: mmap(size=635680) failed with file /var/vmail/example.com/bbbbb/.Inboxes.lwIP/dovecot.index: Cannot allocate memory
Nov  8 03:42:35 fawkes dovecot: indexer-worker(bbbbb@example.com)<24232><AEMlMytpp1+oXgAA+4JX8g>: Error: Raw backtrace: /usr/lib/dovecot/libdovecot.so.0(+0xdb62b) [0x7f759656762b] -> /usr/lib/dovecot/libdovecot.so.0(+0xdb6c1) [0x7f75965676c1] -> /usr/lib/dovecot/libdovecot.so.0(+0x4a149) [0x7f75964d6149] -> /usr/lib/dovecot/libdovecot-storage.so.0(+0x4b602) [0x7f7596666602] -> /usr/lib/dovecot/libdovecot-storage.so.0(mail_transaction_log_move_to_memory+0xd0) [0x7f759671dda0] -> /usr/lib/dovecot/libdovecot-storage.so.0(mail_index_move_to_memory+0x60) [0x7f7596717df0] -> /usr/lib/dovecot/libdovecot-storage.so.0(+0xfcf4f) [0x7f7596717f4f] -> /usr/lib/dovecot/libdovecot-storage.so.0(mail_index_open+0x7a) [0x7f75967180ea] -> /usr/lib/dovecot/libdovecot-storage.so.0(index_storage_mailbox_open+0xac) [0x7f75966f1b8c] -> /usr/lib/dovecot/libdovecot-storage.so.0(+0x84e99) [0x7f759669fe99] -> /usr/lib/dovecot/libdovecot-storage.so.0(+0x84f6f) [0x7f759669ff6f] -> /usr/lib/dovecot/libdovecot-storage.so.0(+0xbdf6c) [0x7f75966d8f6c] -> /usr/lib/dovecot/modules/lib20_zlib_plugin.so(+0x46be) [0x7f75964826be] -> /usr/lib/dovecot/libdovecot-storage.so.0(+0x5bed6) [0x7f7596676ed6] -> /usr/lib/dovecot/libdovecot-storage.so.0(mailbox_open+0x4a) [0x7f75966770ba] -> /usr/lib/dovecot/libdovecot-storage.so.0(mailbox_sync_init+0x95) [0x7f75966774d5] -> /usr/lib/dovecot/libdovecot-storage.so.0(mailbox_sync+0x37) [0x7f7596677547] -> dovecot/indexer-worker(+0x2af7) [0x55f124432af7] -> /usr/lib/dovecot/libdovecot.so.0(io_loop_call_io+0x6f) [0x7f759657dadf] -> /usr/lib/dovecot/libdovecot.so.0(io_loop_handler_run_internal+0x136) [0x7f759657f0d6] -> /usr/lib/dovecot/libdovecot.so.0(io_loop_handler_run+0x4c) [0x7f759657db7c] -> /usr/lib/dovecot/libdovecot.so.0(io_loop_run+0x40) [0x7f759657dce0] -> /usr/lib/dovecot/libdovecot.so.0(master_service_run+0x13) [0x7f75964fe0d3] -> dovecot/indexer-worker(main+0xea) [0x55f1244326aa] -> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f75962c109b] -> dovecot/indexer-worker(_start+0x2a) [0x55f12443278a]
Nov  8 03:42:36 fawkes dovecot: indexer: Error: Indexer worker disconnected, discarding 30 requests for bbbbb@example.com

And backtrace:

Core was generated by `dovecot/indexer-worker'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f75962d47bb in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f75962d47bb in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f75962bf535 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f75964d6481 in ?? () from /usr/lib/dovecot/libdovecot.so.0
#3  0x00007f75965676c1 in ?? () from /usr/lib/dovecot/libdovecot.so.0
#4  0x00007f75964d6149 in i_panic () from /usr/lib/dovecot/libdovecot.so.0
#5  0x00007f7596666602 in ?? () from /usr/lib/dovecot/libdovecot-storage.so.0
#6  0x00007f759671dda0 in mail_transaction_log_move_to_memory () from /usr/lib/dovecot/libdovecot-storage.so.0
#7  0x00007f7596717df0 in mail_index_move_to_memory () from /usr/lib/dovecot/libdovecot-storage.so.0
#8  0x00007f7596717f4f in ?? () from /usr/lib/dovecot/libdovecot-storage.so.0
#9  0x00007f75967180ea in mail_index_open () from /usr/lib/dovecot/libdovecot-storage.so.0
#10 0x00007f75966f1b8c in index_storage_mailbox_open () from /usr/lib/dovecot/libdovecot-storage.so.0
#11 0x00007f759669fe99 in ?? () from /usr/lib/dovecot/libdovecot-storage.so.0
#12 0x00007f759669ff6f in ?? () from /usr/lib/dovecot/libdovecot-storage.so.0
#13 0x00007f75966d8f6c in ?? () from /usr/lib/dovecot/libdovecot-storage.so.0
#14 0x00007f75964826be in ?? () from /usr/lib/dovecot/modules/lib20_zlib_plugin.so
#15 0x00007f7596676ed6 in ?? () from /usr/lib/dovecot/libdovecot-storage.so.0
#16 0x00007f75966770ba in mailbox_open () from /usr/lib/dovecot/libdovecot-storage.so.0
#17 0x00007f75966774d5 in mailbox_sync_init () from /usr/lib/dovecot/libdovecot-storage.so.0
#18 0x00007f7596677547 in mailbox_sync () from /usr/lib/dovecot/libdovecot-storage.so.0
#19 0x000055f124432af7 in ?? ()
#20 0x00007f759657dadf in io_loop_call_io () from /usr/lib/dovecot/libdovecot.so.0
#21 0x00007f759657f0d6 in io_loop_handler_run_internal () from /usr/lib/dovecot/libdovecot.so.0
#22 0x00007f759657db7c in io_loop_handler_run () from /usr/lib/dovecot/libdovecot.so.0
#23 0x00007f759657dce0 in io_loop_run () from /usr/lib/dovecot/libdovecot.so.0
#24 0x00007f75964fe0d3 in master_service_run () from /usr/lib/dovecot/libdovecot.so.0
#25 0x000055f1244326aa in main ()
(gdb)

As suggested, this was with:

service imap {
  client_limit = 1
}
service indexer-worker {
        vsz_limit = 1GB
}

ams001 commented 3 years ago

I misssed couple more of these unfortunately:

Nov  8 05:26:32 fawkes dovecot: indexer-worker(ccccc@example.com)<31709><oNIQLCN+p1/dewAA+4JX8g>: Error: FTS Xapian: Memory error 'std::bad_alloc'
Nov  8 05:26:32 fawkes dovecot: indexer-worker(ccccc@example.com)<31709><oNIQLCN+p1/dewAA+4JX8g>: Error: Mailbox errors: Mail search failed: Internal error occurred. Refer to server log for more information. [2020-11-08 05:12:03]
Nov  8 05:26:32 fawkes dovecot: indexer-worker(ccccc@example.com)<31709><oNIQLCN+p1/dewAA+4JX8g>: Error: Mailbox errors: Transaction commit failed: FTS transaction commit failed: transaction context (attempted to index 1 messages (UIDs 1..1))
Nov  8 06:35:13 fawkes dovecot: indexer-worker(ddddd@example.com)<8597><gO2uLiyOp1+VIQAA+4JX8g>: Error: FTS Xapian: Memory error 'std::bad_alloc'
Nov  8 06:35:14 fawkes dovecot: indexer-worker(ddddd@example.com)<8597><gO2uLiyOp1+VIQAA+4JX8g>: Error: Mailbox misc: Mail search failed: Internal error occurred. Refer to server log for more information. [2020-11-08 06:20:28]
Nov  8 06:35:14 fawkes dovecot: indexer-worker(ddddd@example.com)<8597><gO2uLiyOp1+VIQAA+4JX8g>: Error: Mailbox misc: Transaction commit failed: FTS transaction commit failed: transaction context (attempted to index 1465 messages (UIDs 1..3202))
Nov  8 06:50:05 fawkes dovecot: indexer-worker(ddddd@example.com)<8597><aH4lDaKRp1+VIQAA+4JX8g>: Error: FTS Xapian: Memory error 'std::bad_alloc'
Nov  8 06:50:06 fawkes dovecot: indexer-worker(ddddd@example.com)<8597><aH4lDaKRp1+VIQAA+4JX8g>: Error: Mailbox urgent: Mail search failed: Internal error occurred. Refer to server log for more information. [2020-11-08 06:35:14]
Nov  8 06:50:06 fawkes dovecot: indexer-worker(ddddd@example.com)<8597><aH4lDaKRp1+VIQAA+4JX8g>: Error: Mailbox urgent: Transaction commit failed: FTS transaction commit failed: transaction context (attempted to index 3 messages (UIDs 1..3))

grosjo commented 3 years ago

@ams001 are you using latest git ? Can you please provide full log (with fts xapian log messages) ?

grosjo commented 3 years ago

The backtrace above suggests that the crash happens outside fts but rather in the storage engine. Maybe your disk is full ?

ams001 commented 3 years ago

The backtrace above suggests that the crash happens outside fts but rather in the storage engine. Maybe your disk is full ?

Far from it. 800GB+ free. Its a 1TB raid1. 20‰ usage.

ams001 commented 3 years ago

@ams001 are you using latest git ? Can you please provide full log (with fts xapian log messages) ?

Yes. Updated yesterday just before I started the run.

I'm going to snapshot this live system and run it offline in a vm so I can run tests on demand. As I mentioned, this is a production host, and with full logging, the impact on performance is noticeable which is why I mainly run tests over the weekend. I've run out of time this weekend, so it will have to be next weekend.

Unfortunately there are also new errors, but most look memory related, so I'm also going to up vsz_limit to 4GB for the next run to see if 1GB is the cause. I'll file these separately when I get the chance.

grosjo commented 3 years ago

To test the memory issue, you need to put a low value for vsz_limit (1GB is fine) and set "verbose=1" in dovecot.conf

On my side, it reacts properly (clean the memory before it is too full)

grosjo commented 3 years ago

Note ; I use it on a 200GB mailbox

ams001 commented 3 years ago

To test the memory issue, you need to put a low value for vsz_limit (1GB is fine) and set "verbose=1" in dovecot.conf

On my side, it reacts properly (clean the memory before it is too full)

OK, I'll give that a try on the next run.

The largest size we have is 6.2GB (compressed), the smallest is 950MB. So it does not make much sense that we see this error but you don't. It runs nagios so we'd see any intermediate memory or disk space issues.

grosjo commented 3 years ago

@ams001 Any feedback on this ?

ams001 commented 3 years ago

@grosjo Sorry, I have not had the time this week to setup a VM to test this on - I had two releases to spin, QA and ship. I should be able to test further next week.

grosjo commented 3 years ago

Hi

Any feedback ? all ok ?

ams001 commented 3 years ago

Hi

Sorry, ment to get back to you yesterday, so thanks for the reminder. The "std::bad_alloc" issue no longer occurs, and since that is what this report is about, you can close this issue.

There is another issue unfortunately which I'll file when I know more. It may be a xapian issue but I have a huge difference in the size of the xapian_index directories between the live system and the snapshot system (which is quiet and in some cases upto 80% smaller). What appears to be happening, but not always, is that while a folder/inbox is being indexed, if an email arrived in the folder/inbox, the indexing appears to restart but some index files, possibly temporary, are left behind. Sometimes this happens several times while indexing a large inbox (which takes a while). This could explain the large variation in the sizes of the xapian_index directories between my live and quiescent snapshot. Does this make sense?

jnahmias commented 3 years ago

@ams001 - that's great news! thanks for the confirmation. @grosjo - would you mind tagging a new release version, i'd like to update the package in Debian.

grosjo commented 3 years ago

@ams001 thanks, feel free to open another issue when you have some info @jnahmias : done

grosjo / fts-xapian

Crash while indexing - terminate called after throwing an instance of 'std::bad_alloc' #59