SpamAssassin data directory permissions preventing live training

dhpiggott commented 10 years ago

I observe the SpamAssassin global statistics:

david@elm:~⟫ sa-learn --dump magic --dbpath /home/user-data/mail/spamassassin/                                                                                                                                                                                                  
config: path "/home/david/.spamassassin/user_prefs" is inaccessible: Permission denied
0.000          0          3          0  non-token data: bayes db version
0.000          0        805          0  non-token data: nspam
0.000          0      28976          0  non-token data: nham
0.000          0     125587          0  non-token data: ntokens
0.000          0 1390477427          0  non-token data: oldest atime
0.000          0 1413019640          0  non-token data: newest atime
0.000          0 1413043327          0  non-token data: last journal sync atime
0.000          0 1412674765          0  non-token data: last expiry atime
0.000          0   22118400          0  non-token data: last expire atime delta
0.000          0    1140419          0  non-token data: last expire reduction count

I send myself an email from my old Gmail address.

I observe the SpamAssassin global statistics again and find that nham has not incremented:

david@elm:~⟫ sa-learn --dump magic --dbpath /home/user-data/mail/spamassassin/
config: path "/home/david/.spamassassin/user_prefs" is inaccessible: Permission denied
0.000          0          3          0  non-token data: bayes db version
0.000          0        805          0  non-token data: nspam
0.000          0      28976          0  non-token data: nham
0.000          0     125587          0  non-token data: ntokens
0.000          0 1390477427          0  non-token data: oldest atime
0.000          0 1413019640          0  non-token data: newest atime
0.000          0 1413043327          0  non-token data: last journal sync atime
0.000          0 1412674765          0  non-token data: last expiry atime
0.000          0   22118400          0  non-token data: last expire atime delta
0.000          0    1140419          0  non-token data: last expire reduction count

I look at the logs and see this:

...
Oct 11 16:26:35 elm postfix/smtpd[25125]: disconnect from mail-la0-x231.google.com[2a00:1450:4010:c03::231]
Oct 11 16:26:35 elm spampd[9374]: processing message <CAKHNFFcEvdjzunYw8y7-hMBxXO60H1KrK+hq+ovskovaPODUrg@mail.gmail.com> for <david@piggott.me.uk>
Oct 11 16:26:35 elm spampd[9374]: bayes: cannot write to /home/user-data/mail/spamassassin/bayes_journal, bayes db update ignored: Permission denied#012
Oct 11 16:26:35 elm spampd[9374]: plugin: eval failed: bayes: (in learn) locker: safe_lock: cannot create tmp lockfile /home/user-data/mail/spamassassin/bayes.lock.elm.dhpiggott.net.9374 for /home/user-data/mail/spamassassin/bayes.lock: Permission denied#012
Oct 11 16:26:35 elm spampd[9374]: clean message <CAKHNFFcEvdjzunYw8y7-hMBxXO60H1KrK+hq+ovskovaPODUrg@mail.gmail.com> (-1.49/5.00) from <dhpiggott@gmail.com> for <david@piggott.me.uk> in 0.07s, 2043 bytes.
...

These are the permissions for the SA database - just as setup/spamassasin.sh sets them:

david@elm:/home/user-data/mail/spamassassin⟫ ls -la
total 6112
drwxrwxr-x 2 mail mail        4096 Oct 11 16:02 .
drwxrwxr-x 8 root www-data    4096 Oct  4 16:52 ..
-rwxrwxr-x 1 mail mail     2670592 Oct 11 06:55 bayes_seen
-rwxrwxr-x 1 mail mail     5025792 Oct 11 16:02 bayes_toks

dhpiggott commented 10 years ago

I change the owner of the SA database directory to spampd:

david@elm:~⟫ sudo chown -R spampd:spampd /home/user-data/mail/spamassassin
david@elm:~⟫ ls -la !$
ls -la /home/user-data/mail/spamassassin
total 6116
drwxrwxr-x 2 spampd spampd      4096 Oct 11 16:44 .
drwxrwxr-x 8 root   www-data    4096 Oct  4 16:52 ..
-rwxrwxr-x 1 spampd spampd   2670592 Oct 11 16:44 bayes_seen
-rwxrwxr-x 1 spampd spampd   5025792 Oct 11 16:44 bayes_toks

I send myself another test email.

I observe the SpamAssassin global statistics again and find that nham has incremented:

david@elm:~⟫ sa-learn --dump magic --dbpath /home/user-data/mail/spamassassin/
0.000          0          3          0  non-token data: bayes db version
0.000          0        805          0  non-token data: nspam
0.000          0      28977          0  non-token data: nham
0.000          0     125604          0  non-token data: ntokens
0.000          0 1390477427          0  non-token data: oldest atime
0.000          0 1413046040          0  non-token data: newest atime
0.000          0 1413043327          0  non-token data: last journal sync atime
0.000          0 1412674765          0  non-token data: last expiry atime
0.000          0   22118400          0  non-token data: last expire atime delta
0.000          0    1140419          0  non-token data: last expire reduction count

I observe the SA database directory and note that a journal file now exists:

david@elm:~⟫ !-2
ls -la /home/user-data/mail/spamassassin
total 6120
drwxrwxr-x 2 spampd spampd      4096 Oct 11 16:47 .
drwxrwxr-x 8 root   www-data    4096 Oct  4 16:52 ..
-rw------- 1 spampd spampd      4608 Oct 11 16:47 bayes_journal
-rwxrwxr-x 1 spampd spampd   2670592 Oct 11 16:47 bayes_seen
-rwxrwxr-x 1 spampd spampd   5025792 Oct 11 16:47 bayes_toks

I look at the logs and see success there too:

Oct 11 16:47:20 elm postfix/smtpd[11008]: disconnect from mail-lb0-x236.google.com[2a00:1450:4010:c04::236]
Oct 11 16:47:20 elm spampd[9368]: processing message <CAKHNFFfX-daQBX212YXO7aijyOTS9ziNQg9m+Ve-QLE07sec1w@mail.gmail.com> for <david@piggott.me.uk>
Oct 11 16:47:20 elm spampd[9368]: clean message <CAKHNFFfX-daQBX212YXO7aijyOTS9ziNQg9m+Ve-QLE07sec1w@mail.gmail.com> (-1.49/5.00) from <dhpiggott@gmail.com> for <david@piggott.me.uk> in 0.04s, 2043 bytes.

So at least on the surface it would seem that setup/spamassasin.sh should be changed to make spampd the owner, not mail.

dhpiggott commented 10 years ago

Nack, I was a little too hasty there, sorry! I just found that though the change fixes live training for incoming mail, it breaks it for sieve training when manually moving a received email between IMAP folders. There's no error, but the SA stats don't change; both mail and spampd need write permissions.

Can you reopen this ticket or should I open another?

dhpiggott commented 10 years ago

Thanks.

As I've defined an alias for root@primary-hostname that forwards to my real account, I get emailed the output of cron jobs. I also note the following warning for the daily spamassasin update:

...
/etc/cron.daily/spamassassin:
Oct 11 06:55:30.890 [19066] warn: bayes: cannot write to /home/user-data/mail/spamassassin/bayes_journal, bayes db update ignored: Permission denied
bayes: cannot write to /home/user-data/mail/spamassassin/bayes_journal, bayes db update ignored: Permission denied

Inspection of /etc/cron.daily/spamassassin shows that the failing command is (I think) su - debian-spamd -c "sa-update --gpghomedir /var/lib/spamassassin/sa-update-keys", i.e. it's running as user debian-spamd (even if it's not that command failing, many/most are run as debian-spamd).

So I think the fix for this will be to make the spamassasin directory owned and writeable by a group which has debian-spampd, mail and spampd as members. I'm just not sure which group it should be - I know any will work, I'm just uncertain about any security implications.

Do you have any thoughts on whether it should be owned by mail/spampd/something else (and therefore which group should have all those users as members)? I'm leaning toward spampd.

JoshData commented 10 years ago

Yikes so many groups!

I wonder also how the permissions will get set when the files are first created.

I can't think of a reason to set up the group one way or another.

dhpiggott commented 10 years ago

I just tried to check how the files first get created by running a Vagrant deploy and using test_mail.py against it. It seems they don't. It may be that the only reason they exist on my actual deployment is because I manually ran sa-learn to train against my imported Maildir.

vagrant@mailinabox:/home/user-data/mail/spamassassin$ ls -la
total 8
drwxrwxr-x 2 spampd spampd   4096 Oct 11 18:14 .
drwxrwxr-x 8 root   www-data 4096 Oct 11 18:19 ..

Despite no logged errors:

Oct 11 18:28:03 mailinabox postfix/smtps/smtpd[4331]: disconnect from unknown[192.168.50.1]
Oct 11 18:28:03 mailinabox dovecot: lmtp(4335): Connect from 127.0.0.1
Oct 11 18:28:03 mailinabox spampd[21464]: processing message (unknown) for <me@95aad.justtesting.email>
Oct 11 18:28:03 mailinabox spampd[21464]: clean message (unknown) (2.30/5.00) from <me@95aad.justtesting.email> for <me@95aad.justtesting.email> in 0.04s, 905 bytes.

I went ahead and manually trained against the empty Spam maildirs to confirm that running sa-learn does create the files:

vagrant@mailinabox:/home/user-data/mail/spamassassin$ sudo sa-learn --spam /home/user-data/mail/mailboxes/*/*/.Spam/{cur,new}/
Learned tokens from 0 message(s) (0 message(s) examined)
vagrant@mailinabox:/home/user-data/mail/spamassassin$ ls -la
total 24
drwxrwxr-x 2 spampd spampd    4096 Oct 11 18:33 .
drwxrwxr-x 8 root   www-data  4096 Oct 11 18:19 ..
-rw------- 1 root   root     12288 Oct 11 18:32 bayes_seen
-rw------- 1 root   root     12288 Oct 11 18:33 bayes_toks

I think the fix will now be to do something like:

Add user debian-spamd to group mail in setup/spamassassin.sh.
Add user spampd to group mail in setup/spamassassin.sh, restart spampd.
Make the spamassassin directory ownership mail:mail.
Set the setgid flag on the spamassassin directory.
Uncomment the initial training commands and modify them so that:
1. They are run as user spampd.
2. They are only run if the files do not already exist (though sa-learn is idempotent and will not "learn things twice", running against my archive of nearly 30,000 emails on a Linode 1GB takes what feels like at least 10 minutes, and we don't want to make upgrades take that long all for nothing!).
3. Nice to have: recurse on all subfolders excluding Spam.
4. Alternative: can we tell sa-learn to just create empty files? While it would be nice to have it fully learn from existing mailboxes, the time it can take makes doing so in setup a bad idea, and the files are in $STORAGE anyway so there's no need to relearn when recovering from a backup.
My reason for changing to making user spampd a member of group mail rather than user mail a member of group spampd is because this way sa-learn can read the mailboxes - I hadn't realised it would be necessary to run sa-learn to create the SA database files.

But before I make these changes this I think I should read up about SpamAssassin a bit more and look at examples of other configurations to check this is the best way (before I switched to mailinbox I was using DSPAM but I never really understood it anyway).

Until then, I don't see any need to revert the change you've already merged though as I don't think sieve-training not working is any worse than receiving-training not working.

JoshData commented 10 years ago

Let's try to keep this simple. 1-4 should be enough. In place of 4, sa-learn.sh could be modified to explicitly set better permissions on the generated files. [edit: not recommending this specifically, just mentioning it]

Thanks.

dhpiggott commented 10 years ago

I'll certainly try to keep it simple.

I resumed looking at this, nerd-sniped myself into looking at a bunch of Postfix/SpamAssassin docs, distilled it down to a few hopefully-relevant tabs, and then ran out of time. I'm just posting this comment as my notes for when I next work on this and/or for anyone else interested.

Useful references:

Notes to self:

The sieve script, conf/sieve-spam.txt, has setflag "\\Seen". In my old DSPAM setup I deliberately didn't do this so that I would notice the presence of new spam from the unread count in my IMAP client without having to actually open the folder to check. I seem to recall that Gmail does the same thing (leaves spam as unread), and from a usability point of view I find Gmail to be a good model.
Try to find out what the rationale for SpamAssassin wrapping spam as attachments is, and whether/how training works properly if done on a mixed folder (wrapped and unwrapped spam).

JoshData commented 10 years ago

Hey,

I want to try to get this wrapped up so I can push another release, so I dug into it a bit. I couldn't get it to work either adding spampd to the mail group or vice versa. Adding the group with usermod -G had no effect. Don't know why modifying the spampd user didn't work. Dovecot sort of explicitly doesn't let you do it but has an option mail_access_groups that lets you specify other groups to run as. So I added spampd to that list, and that took care of the spampd process and the sa-learn script.

I haven't been getting errors with debian-spamd (not sure why not) so I didn't try to fix that, since I don't have a way to test if it worked.

832860d79647573f6beeb8871e9d2f21b421dd69 (there's another commit on top of that that reorganizes spamassassin.sh)

Let me know if it works for you?

JoshData commented 10 years ago

sorry that's 7ca54a2bfb12179ffbd8d0c00f44efee7d0e5a4e

dhpiggott commented 10 years ago

The changes look good to me - they should definitely be an improvement. I have two minor concerns:

I wonder if there is going to be a problem with the journal file. As it won't necessarily exist at the point that the chmod -R runs it may end up being created later by the spampd process (so as the spampd user) during processing of incoming mail, in which case, would the group permissions allow sieve triggered retraining (as the mail user) to write to it?
Does storage/mail/spamassassin really need to be unreadable by other users? In the absence of any stats the in web UI (and I'm not suggesting there should be any) I'm using sa-learn --dump magic to check things are working as they should, and previously I could run it as my non-privileged user - it's just nice to not have to sudo unnecessarily.

Stupid question re. adding spampd to the mail group: did you restart spampd after doing so?

If we can leave this open I'll hopefully resolve the debian-spampd issue soon enough.

dhpiggott commented 10 years ago

I confirm both incoming learning via the LMTP proxy and sieve relearning are now working for me - sa-learn --dump shows nspam and nham counts change as expected.

The one question I find myself asking now is why any learning needs to be done by the LMTP proxy - why doesn't/can't the sieve script also take care of training on new mail as it's delivered to Dovecot? That'd surely be simpler.

dhpiggott commented 10 years ago

No error output from /etc/cron.daily/spamassassin this morning, though I don't know why!

JoshData commented 10 years ago

I wonder if there is going to be a problem with the journal file.

Hmm. I have never actually seen the journal file. But I get your point. It might also be created by the sa-learn-pipe script and owned by mail, locking out spampd.

Does storage/mail/spamassassin really need to be unreadable by other users?

No (I assume all local processes are trusted) but it seemed like a nice thing to do.

did you restart spampd after doing so?

Pretty sure. I know that's necessary for groups. But it's surprising it didn't work so maybe I messed something up.

why any learning needs to be done by the LMTP proxy

I didn't even realize learning was happening then. If we can turn that off then maybe we can re-do this again with the files owned by mail? (I don't really want to re-do it though.)

dhpiggott commented 10 years ago

On the cron job: I've not seen any further errors so I'm going to let that one go without fully understanding it.

On learning: by adding ADDOPTS="--config=/etc/spampd.conf" to /etc/default/spampd and bayes_auto_learn 0 to /etc/spampd.conf, I have just successfully switched off training within spampd itself, and confirmed that the sieve rule takes care of training when incoming mail is placed in my inbox.

This would simplify the configuration greatly and invalidate those concerns about the journal file. I'm going to have a go at making a change that would redo this - I think it will be worth it, but you can be judge of that if/when I have something to show! It should just amount to reverting four commits and adding one with the two parameters above.

JoshData commented 10 years ago

Sounds good.

dhpiggott commented 10 years ago

I take my above comment about learning back. I was mistaken in thinking Dovecot antispam was taking care of learning when I had disabled learning in spampd via spampd.conf (it involves the pretty stupid mistake of me adding my bayes_auto_learn 0 line as bayes_auto_learn 1).

When I then actually disabled learning in spampd I found that incoming mail was not fed to sa-learn by Dovecot antispam, and doing some reading I realised Dovecot antispam is already being used for as much as it can be - it is only meant for retraining, so we do have to have spampd handle learning on incoming mail.

As for the journal file, according to http://commons.oreilly.com/wiki/index.php/SpamAssassin/SpamAssassin_as_a_Learning_System, the bayes_learn_to_journal parameter is disabled by default on SpamAssassin 3.0 (the version provided by Ubuntu 14.04 is 3.4) and I can't see that it has been enabled anywhere, so I don't even know why I was seeing a journal file (looking right now, I don't see one).

In conclusion, I'm happy to close this issue now - the changes you made really do seem to be the best fix for the learning permissions problem - thanks!

JoshData commented 10 years ago

Ok thanks again for looking into all this. I guess we got lucky that the fix we ended up with actually was the right approach. :)

mail-in-a-box / mailinabox

SpamAssassin data directory permissions preventing live training #231