slusarz / dovecot-fts-flatcurve

Dovecot FTS Flatcurve plugin (Xapian)
https://slusarz.github.io/dovecot-fts-flatcurve/
GNU Lesser General Public License v2.1
38 stars 8 forks source link

imap segfault with fts flatcurve #54

Closed kadlec-g closed 4 months ago

kadlec-g commented 5 months ago

We started to use fts flatcurve on Debian bookworm with the following components:

Everything seems to work fine, except imap crashes from time to time with different users:

Feb 16 21:52:06 mail2 kernel: [459605.167041] imap[55238]: segfault at 8 ip 0000620dfcc4aee9 sp 0000713494feb4e0 error 4 in libxapian.so.30.12.3[620dfcc14000+173000]

Enabling coredumps the backtrace shows:

0 0x00006ffff4529ee9 in Xapian::Database::get_document(unsigned int) const ()

from /lib/x86_64-linux-gnu/libxapian.so.30

1 0x00006ffff6e67cb0 in fts_flatcurve_xapian_uid_exists_db (uid=uid@entry=8040,

backend=0x1a4f0a594a70) at fts-backend-flatcurve-xapian.cpp:916

2 0x00006ffff6e6bb34 in fts_flatcurve_xapian_write_db_by_uid (uid=8040,

backend=0x1a4f0a594a70) at fts-backend-flatcurve-xapian.cpp:935

3 fts_flatcurve_xapian_expunge (backend=0x1a4f0a594a70, uid=8040)

at fts-backend-flatcurve-xapian.cpp:1179

4 0x00006ffff75050ba in mailbox_sync_notify ()

from /usr/lib/dovecot/libdovecot-storage.so.0

5 0x00006ffff75292e1 in sdbox_sync_finish ()

It happens consistently at IMAP EXPUNGE, in all cases. The mailboxes are on NFS but a single machine accesses them. Any hint is really appreciated.

kadlec-g commented 4 months ago

In order to make sure it's not related to locking over NFS, the mount option local_lock=all was enabled and in mail_location we set VOLATILEDIR=/dev/shm/dovecot/%u. Still, the segfaults happen.

kadlec-g commented 4 months ago

Enabling debuggin revealed more informations:

Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Debug: fts-flatcurve(Spam): Cannot open DB (RO; index.6340); DatabaseNotFoundError: Couldn't detect type of database Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Debug: fts-flatcurve(Spam): Opened DB (RO) messages=4 version=1 shards=1 Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Fatal: master: service(imap): child 42566 killed with signal 11

We have got the following settings to exclude mailboxes:

fts_autoindex_exclude = Trash
fts_autoindex_exclude2 = Spam

[According to the documentation, mailbox name or special use flag can be used.] Checking the database of the Spam mailbox does not show an error:

doveadm fts-flatcurve check -u XXXX mailbox Spam Spam guid=784d14359d22265d94c0000048cecde4 errors=0 shards=1

edieterich commented 4 months ago

I think I have the same problem. Can you check if you have empty Flatcurve DB directories for the affected users? And if you do, does the crash go away if you delete them?

The empty DB directories appear on NFS if Flatcurve optimizes the index. There's is code in Flatcurve that deletes empty DB directories, but that fails on NFS because .nfs files are kept open in the DB directory.

The attached patch (flatcurve_fix_empty_db_dir.patch) fixes the problem for me.

I added a second patch (flatcurve_work_around_unlink_directory_bug.patch), because this problem goes unnoticed when Flatcurves tries to delete empty DB directories. This is actually a bug in Dovecot core, I guess, because the used unlink_directory() function doesn't report an error if the final rmdir fails.

kadlec-g commented 4 months ago

No, the directories are not empty. As I'm unfamiliar with xapian, I'm just guessing the meaning of the log lines:

Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Debug: fts-flatcurve(Spam): Cannot open DB (RO; index.6340); DatabaseNotFoundError: Couldn't detect type of database

As far as I see there was an dbox/mailboxes/Spam/dbox-Mails/fts-flatcurve/index.6340/ directory under the user's home and either it was empty or it was broken.

Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Debug: fts-flatcurve(Spam): Cannot open DB (RO; current.1708356974118889); DatabaseNotFoundError: Couldn't detect type of database

I missed to include this line: is it a test to create a new database? But the line says RO, so it should not create new database. The mailbox is explicitly excluded from indexing, so it should be left out. What's " current.1708356974118889" then?

Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Debug: fts-flatcurve(Spam): Opened DB (RO) messages=4 version=1 shards=1

According to the line, the db was successfully openen, it contains 4 messages, it's version number is 1 and contains one shard.

Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Fatal: master: service(imap): child 42566 killed with signal 11

And then comes the segmentation fault. Looking into the directory in question, there's an "index.3514" directory there with modification date earlier:

drwx------ 2 XXXX XXXX 4096 Feb 20 06:27 index.3514

The directory looks OK (except it should not exist at all):

-rw------- 1 XXXX XXXX 0 Feb 20 06:27 flintlock -rw------- 1 XXXX XXXX 93 Feb 20 06:27 iamglass -rw------- 1 XXXX XXXX 32768 Feb 20 06:27 postlist.glass -rw------- 1 XXXX XXXX 8192 Feb 20 06:27 termlist.glass

At 6:25 we run a cron job which calls "doveadm fts optimize -A". That command messes up the excluded mailboxes?

kadlec-g commented 4 months ago

I think I have the same problem. Can you check if you have empty Flatcurve DB directories for the affected users? And if you do, does the crash go away if you delete them?

The empty DB directories appear on NFS if Flatcurve optimizes the index. There's is code in Flatcurve that deletes empty DB directories, but that fails on NFS because .nfs files are kept open in the DB directory.

The .nfs files are created by the NFS client when while a process keeps open a file, another one removes it. So while the file is kept open, the .nfs file is there and cannot be removed. It migth be some syncronization, timing issue between two processes.

slusarz commented 4 months ago

Need to work on this more ... but I'm fairly certain that @edieterich first patch is the correct fix here.

The code that is crashing is iterating through the internal DB hash table - which would potentially include stale DB entries containing Xapian objects that point to data directories that no longer exist. These DB entries should be removed from the table when they are determined to be invalid, which is missing in the current code; only the physical directories are being removed, not the internal cache state.

The log entries indicate flatcurve is otherwise working as expected. "Couldn't detect type of database" is the error Xapian would throw if the shard directory was deleted. After an optimize, you should end up with a single, newly created shard, and the successful load of this shard doesn't have a debug entry. It thus looks like flatcurve is correctly detecting the missing directories, it's just not deleting them from the internal table.

Need to verify with a unit test before committing the fix. Pretty sure this can only happen with long running mail sessions that had previously accessed the mailbox - this is essentially a caching issue.

@edieterich : any chance you can submit your first patch as an official merge request?

edieterich commented 4 months ago

My pull request: https://github.com/slusarz/dovecot-fts-flatcurve/pull/55

kadlec-g commented 4 months ago

Hello,

On Wed, 21 Feb 2024, Michael M Slusarz wrote:

Need to work on this more ... but I'm fairly certain that @edieterich first patch is the correct fix here.

The code that is crashing is iterating through the internal DB hash table - which would potentially include stale DB entries containing Xapian objects that point to data directories that no longer exist. These DB entries should be removed from the table when they are determined to be invalid, which is missing in the current code; only the physical directories are being removed, not the internal cache state.

The log entries indicate flatcurve is otherwise working as expected. "Couldn't detect type of database" is the error Xapian would throw if the shard directory was deleted. After an optimize, you should end up with a single, newly created shard, and the successful load of this shard doesn't have a debug entry. It thus looks like flatcurve is correctly detecting the missing directories, it's just not deleting them from the internal table.

Need to verify with a unit test before committing the fix. Pretty sure this can only happen with long running mail sessions that had previously accessed the mailbox - this is essentially a caching issue.

@edieterich : any chance you can submit your first patch as an official merge request?

As I wrote, .nfs file cannot be deleted: a new one is created if you delete one. So my approach was different: in fts_flatcurve_xapian_db_iter_init() I ignored all files which are not valid db types.

It did not help.

So I created a second patch on top of it: flatcurve uses the user directory for lockfile creation. But we use VOLATILEDIR in order to avoid locking over NFS. In the attached second patch I added supporting VOLATILEDIR.

Since then (~12h) we had a single segfault instead of 2-3/h. Unfortunately debugging was not enabled then so I'm waiting if any other segfault happens again, with debugging enabled.

I attached the patches from the fork dovecot-fts-flatcurve-nfs.

Best regards, Jozsef -- E-mail : @.*** PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics H-1525 Budapest 114, POB. 49, Hungary

slusarz commented 4 months ago

"I attached the patches from the fork dovecot-fts-flatcurve-nfs." - no patches showed up here. (I don't think replying to GitHub via email does allows attachments to be converted to MRs...)

slusarz commented 4 months ago

Thanks to @kadlec-g, I added a patch that deals with the volatiledir situation. (VOLATILEDIR used to be handled natively by locking code, but the dev team made changes when flatcurve was merged to core team that switched to a different locking method.)

Probably need to document recommendation of VOLATILEDIR for NFS installations.

Also, still TBD what needs to be done regarding extra files in the directory. I don't like the solution in #56 as it is too specific to a certain kind of file (".nfs") - this needs to be more generic imho. @kadlec-g has provided a different solution, but I still need to review further and think about it.

slusarz commented 4 months ago

These commits should fix the various issues being reported (none that I could reproduce locally, so relying on external feedback):

If these issues don't fix things, please reopen this ticket.

kadlec-g commented 3 weeks ago

Hello,

On Thu, 22 Feb 2024, Kadlecsik József wrote:

As I wrote, .nfs file cannot be deleted: a new one is created if you delete one. So my approach was different: in fts_flatcurve_xapian_db_iter_init() I ignored all files which are not valid db types.

It did not help.

So I created a second patch on top of it: flatcurve uses the user directory for lockfile creation. But we use VOLATILEDIR in order to avoid locking over NFS. In the attached second patch I added supporting VOLATILEDIR.

The patch was not prepared to handle when multiple mail is handled with the same process and fts_backend_flatcurve_set_mailbox() was called multiple times. I applied the next patch on top of the most recent version of your git tree:

diff --git a/src/fts-backend-flatcurve.c b/src/fts-backend-flatcurve.c index e46b2b0..b9cf9fc 100644 --- a/src/fts-backend-flatcurve.c +++ b/src/fts-backend-flatcurve.c @@ -135,7 +135,7 @@ void fts_backend_flatcurve_set_mailbox(struct flatcurve_fts_backend *backend,

    user = mail_storage_get_user(storage);
    volatile_dir = mail_user_get_volatile_dir(user);

Best regards, Jozsef -- E-mail : @.*** PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics H-1525 Budapest 114, POB. 49, Hungary

slusarz commented 3 weeks ago

I'm not seeing what this patch does.

backend->volatile_dir is guaranteed to be an empty string at this point - if a previous mailbox is opened, volatile_dir is truncated to a zero-length string by fts_backend_flatcurve_close_mailbox(). (This is the same mechanism that clears the mailbox name, for example, so if this truncation code isn't being run you would not be able to open the mailbox at all.)

What exactly are you seeing without this change?

kadlec-g commented 3 weeks ago

On Mon, 24 Jun 2024, Michael M Slusarz wrote:

I'm not seeing what this patch does.

backend->volatile_dir is guaranteed to be an empty string at this point - if a previous mailbox is opened, volatile_dir is truncated to a zero-length string by fts_backend_flatcurve_close_mailbox(). (This is the same mechanism that clears the mailbox name, for example, so if this truncation code isn't being run you would not be able to open the mailbox at all.)

Sorry, drop the patch: in my original one which was running here the truncate line for volatile_dir was missing from fts_backend_flatcurve_close_mailbox(). That missing line caused the issue.

Best regards, Jozsef -- E-mail : @.*** PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics H-1525 Budapest 114, POB. 49, Hungary