sympa-community / sympa

Sympa, Mailing List Management Software
https://www.sympa.community/sympa
GNU General Public License v2.0
248 stars 99 forks source link

VERP feature should be disabled #1246

Open dpc22 opened 3 years ago

dpc22 commented 3 years ago

Version

6.2.66

Installation method

My own RPM, derived from the "official" RHEL rpms

Expected behavior

Mail delivery in Sympa shouldn't grind to a halt because of a couple of (admittedly large) messages sent to a list with large numbers of invalid subscribers.

Actual behavior

This appears to the case. I appreciate that this might just be a lack of understanding on my part in terms of the tunables that are available. At the moment we are running 4 bulk.pl and a single sympa_msg.pl. I plan to increase to 8 bulk.pl and maybe 2 sympa_msg.pl tomorrow morning, although the individual daemons do seem to consume a lot of memory.

Additional information

Two separate people sent 9.5 MByte messages to a single mailing list which had 1506 subscribers this morning (6.8% of which are invalid: typically expired @cam.ac.uk addresses which should be removed from the list).

(I am using XXX and YYY to censor usernames and listnames respectively below. I hope that I have done this consistently)

Oct  7 09:19:25 lists-1 sympa_msg[11305]: notice Sympa::Spindle::ProcessIncoming::_twist() Processing Sympa::Message <YYY@lists.cam.ac.uk.1633594757.32122>; envelope_sender=; message_id=LOYP123MB33430252A499F53A5B0719A3CFB19@LOYP123MB3343.GBRP123.PROD.OUTLOOK.COM; sender=clerkofworks@YYY.cam.ac.uk
Oct  7 09:32:33 lists-1 sympa_msg[11305]: notice Sympa::Spindle::ProcessIncoming::_twist() Processing Sympa::Message <YYY@lists.cam.ac.uk.1633595548.11542>; envelope_sender=; message_id=LO3P123MB289038B43701E5A8D1C317F095B19@LO3P123MB2890.GBRP123.PROD.OUTLOOK.COM; sender=tutorial-manager@YYY.cam.ac.uk

Oct  7 09:28:55 lists-1 sympa_msg[11305]: info Sympa::Spindle::ToList::_twist() Message Sympa::Message <YYY@lists.cam.ac.uk_3f4bc37354ac2f40af9e242a8c68f567.distribute/shelved:arc_cv=pass;dkim_sign> for Sympa::List <YYY@lists.cam.ac.uk> from rd222@cam.ac.uk accepted (0.22 seconds, 9 sessions, 1506 subscribers), message ID=LOYP123MB33430252A499F53A5B0719A3CFB19@LOYP123MB3343.GBRP123.PROD.OUTLOOK.COM, size=9489947
Oct  7 09:32:33 lists-1 sympa_msg[11305]: info Sympa::Spindle::ToList::_twist() Message Sympa::Message <YYY@lists.cam.ac.uk.1633595548.11542/shelved:arc_cv=pass;dkim_sign> for Sympa::List <YYY@lists.cam.ac.uk> from tutorial-manager@YYY.cam.ac.uk accepted (0.40 seconds, 9 sessions, 1506 subscribers), message ID=LO3P123MB289038B43701E5A8D1C317F095B19@LO3P123MB2890.GBRP123.PROD.OUTLOOK.COM, size=9488220

The email was delivered to the valid members of the list within about 5 minutes in each case. However something strange happened with the invalid addresses:

Sympa generated VERP messages to these recipients, despite the fact that VERP is not enabled on the list (verp_rate=0%) A bit of Googling suggests that is deliberate, although it is not clear why given that Sympa is clearly able to identify the invalid addresses on the list without needing to use VERP.

Much stranger: Sympa sent multiple VERP messages for each invalid address on the list:

2021-10-07 09:36:12 +0100 1mYOro-0003gR-Bo
        <= bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk U=sympa P=local S=9526604
        id=LOYP123MB33430252A499F53A5B0719A3CFB19@LOYP123MB3343.GBRP123.PROD.OUTLOOK.COM
        for XXX@cam.ac.uk

2021-10-07 09:38:38 +0100 1mYOuk-0004hB-DP
        <= bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk U=sympa P=local S=9524890
        id=LO3P123MB289038B43701E5A8D1C317F095B19@LO3P123MB2890.GBRP123.PROD.OUTLOOK.COM
        for XXX@cam.ac.uk

2021-10-07 09:56:15 +0100 1mYPBG-0008II-Uk
        <= bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk U=sympa P=local S=9526604
        id=LOYP123MB33430252A499F53A5B0719A3CFB19@LOYP123MB3343.GBRP123.PROD.OUTLOOK.COM
        for XXX@cam.ac.uk

2021-10-07 10:05:24 +0100 1mYPK7-0001AX-7v
        <= bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk U=sympa P=local S=9524890
        id=LO3P123MB289038B43701E5A8D1C317F095B19@LO3P123MB2890.GBRP123.PROD.OUTLOOK.COM
        for XXX@cam.ac.uk

2021-10-07 11:05:27 +0100 1mYQGD-0005Mp-2h
        <= bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk U=sympa P=local S=9524890
        id=LO3P123MB289038B43701E5A8D1C317F095B19@LO3P123MB2890.GBRP123.PROD.OUTLOOK.COM
        for XXX@cam.ac.uk

This is the same (invalid) XXX@cam.ac.uk email address in all 5 cases. Each message was 9.5 MBytes in size.

Three of the messages are Message-ID: LO3P123MB289038B43701E5A8D1C317F095B19@LO3P123MB2890.GBRP123.PROD.OUTLOOK.COM while two are Message-ID: LOYP123MB33430252A499F53A5B0719A3CFB19@LOYP123MB3343.GBRP123.PROD.OUTLOOK.COM

These correspond to the two messages sent to the list at 09:19:25 and 09:32:33 this morning. Sympa was stil generating VERP requests and processsing the bounces 2 hours later.

I spotted that there was a problem at around 11:30: Sympa was spending all of its time generating and processing these VERP probes, and no useful work seemed to be taking place.

I restarting Sympa (systemctl restart sympa) and the VERP requests seemed to stop immediately (how are these tracked?).

It then took the machine about 30 minutes to catch up from the backlog of email which had formed.

Does anyone have a suggestion as to what might be going on here? I can understand one VERP message, but not multiple for every invalid recipient on the list.

Here are the Sympa logs which mention the specific @cam.ac.uk email address above (they are all bounce messages processed by bounced).

Oct  7 09:36:15 lists-1 bounced[11311]: notice Sympa::Spindle::ProcessBounce::_twist() Processing Sympa::Message <bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk.1633595772.15222>; envelope_sender=; message_id=E1mYOsu-000yIq-LF@ppsw-31.csi.cam.ac.uk; sender=mailer-daemon@ppsw.cam.ac.uk
Oct  7 09:36:15 lists-1 bounced[11311]: notice Sympa::Spindle::ProcessBounce::_twist() VERP in use: bounce Sympa::Message <bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk.1633595772.15222> related to <XXX@cam.ac.uk> for list Sympa::List <YYY@lists.cam.ac.uk>
Oct  7 09:36:15 lists-1 bounced[11311]: notice Sympa::Tracking::store() Sympa::Message <bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk.1633595772.15222> is stored into Sympa::Tracking as <XXX@cam.ac.uk>
Oct  7 09:36:15 lists-1 bounced[11311]: notice Sympa::Spindle::ProcessBounce::_twist() Received bounce for email address XXX@cam.ac.uk, list Sympa::List <YYY@lists.cam.ac.uk>
Oct  7 09:38:40 lists-1 bounced[11311]: notice Sympa::Spindle::ProcessBounce::_twist() Processing Sympa::Message <bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk.1633595918.18162>; envelope_sender=; message_id=E1mYOvG-000nzQ-ko@ppsw-40.csi.cam.ac.uk; sender=mailer-daemon@ppsw.cam.ac.uk
Oct  7 09:38:40 lists-1 bounced[11311]: notice Sympa::Spindle::ProcessBounce::_twist() VERP in use: bounce Sympa::Message <bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk.1633595918.18162> related to <XXX@cam.ac.uk> for list Sympa::List <YYY@lists.cam.ac.uk>
Oct  7 09:38:40 lists-1 bounced[11311]: notice Sympa::Tracking::store() Sympa::Message <bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk.1633595918.18162> is stored into Sympa::Tracking as <XXX@cam.ac.uk>
Oct  7 09:38:40 lists-1 bounced[11311]: notice Sympa::Spindle::ProcessBounce::_twist() Received bounce for email address XXX@cam.ac.uk, list Sympa::List <YYY@lists.cam.ac.uk>
Oct  7 09:56:16 lists-1 bounced[11311]: notice Sympa::Spindle::ProcessBounce::_twist() Processing Sympa::Message <bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk.1633596976.32237>; envelope_sender=; message_id=E1mYPCK-0011JM-cs@ppsw-30.csi.cam.ac.uk; sender=mailer-daemon@ppsw.cam.ac.uk
Oct  7 09:56:16 lists-1 bounced[11311]: notice Sympa::Spindle::ProcessBounce::_twist() VERP in use: bounce Sympa::Message <bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk.1633596976.32237> related to <XXX@cam.ac.uk> for list Sympa::List <YYY@lists.cam.ac.uk>
Oct  7 09:56:16 lists-1 bounced[11311]: notice Sympa::Tracking::store() Sympa::Message <bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk.1633596976.32237> is stored into Sympa::Tracking as <XXX@cam.ac.uk>
Oct  7 09:56:16 lists-1 bounced[11311]: notice Sympa::Spindle::ProcessBounce::_twist() Received bounce for email address XXX@cam.ac.uk, list Sympa::List <YYY@lists.cam.ac.uk>
Oct  7 10:05:29 lists-1 bounced[11311]: notice Sympa::Spindle::ProcessBounce::_twist() Processing Sympa::Message <bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk.1633597525.4835>; envelope_sender=; message_id=E1mYPLB-000Gsd-QP@ppsw-41.csi.cam.ac.uk; sender=mailer-daemon@ppsw.cam.ac.uk
Oct  7 10:05:29 lists-1 bounced[11311]: notice Sympa::Spindle::ProcessBounce::_twist() VERP in use: bounce Sympa::Message <bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk.1633597525.4835> related to <XXX@cam.ac.uk> for list Sympa::List <YYY@lists.cam.ac.uk>
Oct  7 10:05:29 lists-1 bounced[11311]: notice Sympa::Tracking::store() Sympa::Message <bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk.1633597525.4835> is stored into Sympa::Tracking as <XXX@cam.ac.uk>
Oct  7 10:05:29 lists-1 bounced[11311]: notice Sympa::Spindle::ProcessBounce::_twist() Received bounce for email address XXX@cam.ac.uk, list Sympa::List <YYY@lists.cam.ac.uk>
Oct  7 11:05:32 lists-1 bounced[11311]: notice Sympa::Spindle::ProcessBounce::_twist() Processing Sympa::Message <bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk.1633601128.21073>; envelope_sender=; message_id=E1mYQHI-000sXR-Qp@ppsw-41.csi.cam.ac.uk; sender=mailer-daemon@ppsw.cam.ac.uk
Oct  7 11:05:32 lists-1 bounced[11311]: notice Sympa::Spindle::ProcessBounce::_twist() VERP in use: bounce Sympa::Message <bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk.1633601128.21073> related to <XXX@cam.ac.uk> for list Sympa::List <YYY@lists.cam.ac.uk>
Oct  7 11:05:32 lists-1 bounced[11311]: notice Sympa::Tracking::store() Sympa::Message <bounce+XXX==a==cam.ac.uk==YYY@lists.cam.ac.uk.1633601128.21073> is stored into Sympa::Tracking as <XXX@cam.ac.uk>
Oct  7 11:05:32 lists-1 bounced[11311]: notice Sympa::Spindle::ProcessBounce::_twist() Received bounce for email address XXX@cam.ac.uk, list Sympa::List <YYY@lists.cam.ac.uk>

I can also see the following. I don't know what "starting 5213 seconds after scheduled expedition date" means.

Oct  7 09:31:13 lists-1 bulk[14089]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595335.1633595335.189988.YYY@lists.cam.ac.uk_0,11305,5650/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 138 seconds after scheduled expedition date)
Oct  7 09:31:15 lists-1 bulk[14118]: notice Sympa::Mailer::store() Done sending message Sympa::Message <5.5.1633595335.1633595335.139008.YYY@lists.cam.ac.uk_z,11305,4937/z/shelved:arc_cv=pass;dmarc_protect> for Sympa::List <YYY@lists.cam.ac.uk> (priority 5) in 140 seconds since scheduled expedition date
Oct  7 09:35:03 lists-1 bulk[11316]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 155 seconds after scheduled expedition date)
Oct  7 09:36:10 lists-1 bulk[23312]: notice Sympa::Mailer::store() Done sending message Sympa::Message <5.5.1633595548.1633595553.557707.YYY@lists.cam.ac.uk_z,11305,9441/z/shelved:arc_cv=pass;dmarc_protect> for Sympa::List <YYY@lists.cam.ac.uk> (priority 5) in 222 seconds since scheduled expedition date
Oct  7 09:51:14 lists-1 bulk[14118]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595335.1633595335.189988.YYY@lists.cam.ac.uk_0,11305,5650/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 1339 seconds after scheduled expedition date)
Oct  7 09:59:26 lists-1 bulk[23312]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 1618 seconds after scheduled expedition date)
Oct  7 10:59:21 lists-1 bulk[14089]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 5213 seconds after scheduled expedition date)

I infer that Sympa really doesn't like large emails, although I'm not going to be able to stop people sending them now they all have 100 Gbyte quotas on Exchange Online.

dpc22 commented 3 years ago

Here's a graph of load average for the last 12 hours, demonstrating what happened at 09:30 this morning. The system was running flat out until I ran systemctl restart sympa to restart all of the Sympa mail delivery daemons:

la

dpc22 commented 3 years ago

One final comment for the day:

It looks like the 4 bulk.pl processes each stopped processing anything new for about 90 minutes, but not all at the same time:

Oct  7 09:31:13 lists-1 bulk[14089]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595335.1633595335.189988.YYY@lists.cam.ac.uk_0,11305,5650/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 138 seconds after scheduled expedition date)
<no further mentions of bulk[14089] until:>
Oct  7 10:56:00 lists-1 bulk[14089]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <1.5.1633597195.1633597194.442549.sympa@lists.cam.ac.uk_s,11305,8526/s/shelved:dkim_sign> to lists.cam.ac.uk (priority 1) (starting 3365 seconds after scheduled expedition date)

Oct  7 09:35:03 lists-1 bulk[11316]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <XXX@lists.cam.ac.uk> (priority 5) (starting 155 seconds after scheduled expedition date)
<no further mention of bulk[11316] until:>
Oct  7 11:00:08 lists-1 bulk[11316]: notice Sympa::Mailer::store() Done sending message Sympa::Message <5.5.1633596810.1633596811.442688.ZZZ@lists.cam.ac.uk_z,11305,9627/z/shelved:arc_cv=none;dmarc_protect> for Sympa::List <ZZZ@lists.cam.ac.uk> (priority 5) in 3998 seconds since scheduled expedition date

Oct  7 09:51:14 lists-1 bulk[14118]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595335.1633595335.189988.YYY@lists.cam.ac.uk_0,11305,5650/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 1339 seconds after scheduled expedition date)
<no further mention of bulk[14118] until:>
Oct  7 11:30:30 lists-1 bulk[14118]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <1.5.1633600790.1633600790.822450.ZZZ@lists.cam.ac.uk_s,28721,1495/s/shelved:dkim_sign> to Sympa::List <ZZZ@lists.cam.ac.uk> (priority 1) (starting 1840 seconds after scheduled expedition date)

Oct  7 09:59:26 lists-1 bulk[23312]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 1618 seconds after scheduled expedition date)
<no further mention of bulk[23312]: until I restarted Sympa at:>
Oct  7 11:38:21 lists-1 bulk[23312]: notice main::sigterm() Signal TERM received, still processing current task

I think that the machine was slowing recovering by 11:38, but it recovered much faster after systemctl restart sympa, and there is no evidence of any further VERP involving YYY@lists.cam.ac.uk after that point.

All four bulk.pl processes were stuck on the same YYY@lists.cam.ac.uk list, the one which had received two 9.5 MByte messages around 09:30. ZZZ@lists.cam.ac.uk indicates other, unrelated lists. I observe that a couple of the ZZZ@lists.cam.ac.uk messages are "priority 1". I don't know if those prempt priority 5 messages which are already running.

ikedas commented 3 years ago

@dpc22 , what is "invalid address on the list"?

dpc22 commented 3 years ago

An address which used to be valid but which has expired, so Sympa is collecting a bounce score the subscriber address.

We are a University, so almost 10,000 students arrive and leave each year. After they leave emails to XXX@cam.ac.uk generate a bounce of the form:

XXX@cam.ac.uk is undeliverable: <xxx@cam.ac.uk> is not a known user on this system;
    see https://help.uis.cam.ac.uk/email-bounce

Or maybe:

XXX@cam.ac.uk is undeliverable: Address no longer in use - try AAA@BBB instead

I typically purge invalid/expired @cam.ac.uk addresses from Mailman lists once a year (using a remove_bad_local script), a couple of weeks after the undergraduate accounts are cancelled. The list in question belongs to a postgraduate college. Their students are typically here until October rather than leaving in July, so they have a higher proportion of invalid addresses than most of our lists.

The list in question has 1506 subscribers. 103 of those (6.8%) are listed if I dump the bouncing addresses.

Those 103 invalid addresses appear to be enough to stop 4*bulk,pl in their tracks if two people send 9.5 MBytes to the list at about the same time. I have just increased the virtual machine hosting the service to 4 CPUs and 32 GBytes of RAM so that I can increase the number of bulk.pl processes in use.

I'm going to spend some time today writing a remove_bad_local.pl script so that I can clean up some of these lists which have lots of bad addresses. However I think that there is an issue here which needs to be resolved.

dpc22 commented 3 years ago

Another data point. Here is partial output from:

 perl -ne 'print "$1 $2\n" if m&^Oct  7 (\S+) lists-1 bulk\[\d+\].+in (\d+ seconds since scheduled expedition date)&' sympa.1

I infer that "seconds since scheduled expedition" is equivalent to "time spent on queue". Note the large gap between 09:53:48 and 10:28:10, and again up to 10:56:00, when things suddenly exploded back into life.

09:52:18 13 seconds since scheduled expedition date
09:52:24 2 seconds since scheduled expedition date
09:52:33 4 seconds since scheduled expedition date
09:52:33 4 seconds since scheduled expedition date
09:52:42 4 seconds since scheduled expedition date
09:53:48 20 seconds since scheduled expedition date
10:28:10 2 seconds since scheduled expedition date
10:31:27 6 seconds since scheduled expedition date
10:39:32 2 seconds since scheduled expedition date
10:51:09 2 seconds since scheduled expedition date
10:56:00 3365 seconds since scheduled expedition date
10:56:00 3322 seconds since scheduled expedition date
10:56:01 3323 seconds since scheduled expedition date
10:56:01 3323 seconds since scheduled expedition date
10:56:01 3220 seconds since scheduled expedition date
10:56:01 3219 seconds since scheduled expedition date
10:56:01 3219 seconds since scheduled expedition date
10:56:02 3220 seconds since scheduled expedition date
10:56:02 3220 seconds since scheduled expedition date
10:56:02 3220 seconds since scheduled expedition date
10:56:02 3220 seconds since scheduled expedition date
10:56:02 3220 seconds since scheduled expedition date
10:56:03 3221 seconds since scheduled expedition date
10:56:03 3221 seconds since scheduled expedition date
10:56:03 3221 seconds since scheduled expedition date
10:56:03 3221 seconds since scheduled expedition date
10:56:03 3221 seconds since scheduled expedition date
10:56:04 3222 seconds since scheduled expedition date
10:56:04 3222 seconds since scheduled expedition date
10:56:04 3222 seconds since scheduled expedition date
10:56:04 3222 seconds since scheduled expedition date
10:56:04 3222 seconds since scheduled expedition date
dpc22 commented 3 years ago

Okay bulk.pl fails to start if I set:

bulk_max_count = 6    # Twice the default of 3

I think that the cause is:

Oct  8 08:32:42 lists-1 bulk[34718]: err main::#160 > Sympa::Spindle::spin#83 > Sympa::Spindle::ProcessOutgoing::_twist#422 > Sympa::Mailer::store#118 DIED: Illegal division by zero at /usr/share/sympa/lib/Sympa/Mailer.pm line 118.

which is:

my $maxsmtp =
    int($Conf::Conf{'maxsmtp'} / ($self->{redundancy} || 1)) || 1;

Are there constraints about valid values for bulk_max_count ?

ikedas commented 3 years ago

@dpc22 , can you check verp_rate parameter?

dpc22 commented 3 years ago

verp_rate is set to 0 on the list: that is the default value, which I have not changed.

It appears that Sympa ignores verp_rate and always uses VERP for known bouncing addresses. It isn't clear to me why that is the case.

dpc22 commented 3 years ago

https://github.com/sympa-community/sympa/issues/837

"As for Sympa, it will automatically use VERP on adresses that have produced errors already. Is it the case for you?"

dpc22 commented 3 years ago

systemctl restart sympa at 11:38 yesterday morning appears to have shunted a couple of directories from /var/spool/sympa/bulk/pct/ to /var/spool/sympa/bulk/bad/pct

$ ls -l /var/spool/sympa/bulk/bad/pct/
total 8
drwxr-x--- 2 sympa sympa 4096 Oct  7 11:38 5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061
drwxr-x--- 2 sympa sympa 4096 Oct  7 11:38 5.5.1633600173.1633600173.240042.ZZZ@lists.cam.ac.uk_z,11305,8194

I infer that is what stopped the VERP processing cascade.

ikedas commented 3 years ago

837

"As for Sympa, it will automatically use VERP on adresses that have produced errors already. Is it the case for you?"

Yes, it can be a case.

dpc22 commented 3 years ago

The contents of my /var/spool/sympa/bulk/bad directory are currently:

403701785    4 drwxr-x---   4 sympa    sympa        4096 Jun 19  2019 .
403701786    4 drwxr-x---   2 sympa    sympa        4096 Oct  7 11:38 ./msg
403888748 6728 -rw-r-----   1 sympa    sympa     6886657 Oct  7 11:38 ./msg/5.5.1633600173.1633600173.240042.ZZZ@lists.cam.ac.uk_z,11305,8194
403701787    4 drwxr-x---   4 sympa    sympa        4096 Oct  7 11:38 ./pct
488113070    4 drwxr-x---   2 sympa    sympa        4096 Oct  7 11:38 ./pct/5.5.1633600173.1633600173.240042.ZZZ@lists.cam.ac.uk_z,11305,8194
529536664    4 -rw-r-----   1 sympa    sympa         248 Oct  7 10:49 ./pct/5.5.1633600173.1633600173.240042.ZZZ@lists.cam.ac.uk_z,11305,8194/0005
529536665    4 -rw-r-----   1 sympa    sympa         249 Oct  7 10:49 ./pct/5.5.1633600173.1633600173.240042.ZZZ@lists.cam.ac.uk_z,11305,8194/0006
488113047    4 drwxr-x---   2 sympa    sympa        4096 Oct  7 11:38 ./pct/5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061

I infer that bulk.pl had almost finished with VERP bounces for the two messages to YYY@lists.cam.ac.uk as there are no pct/*YYY*/000X or msg/*YYY*/* files outstanding (maybe some data was cached in the running bulk.pl?). It is possible that it would have just recovered if I had been patient, although 2 hours work for two messages to a single list is a lot.

A separate message 6 MByte message to ZZZ@lists.cam.ac.uk was also pending. "0005" and "0006" each contain 11 (presumably suspect) email addresses scheduled for a VERP test. I infer that it had already finished with 0001 through 0004.

dpc22 commented 3 years ago

Stupid question time. Is there a command line tool that would allow me to remove a specific set of subscribers from a list? I have written a Perl script which dumps the current membership using:

sympa.pl --dump_users --list test-dpc22@lists.cam.ac.uk --role=member

and manipulates the member.dump file which is generated to remove invalid/expired XXX@cam.ac.uk addresses. However the reverse operation:

sympa.pl --restore_users --list test-dpc22@lists.cam.ac.uk --role=member

only appears to allow you do add new users, not remove existing ones. Thanks.

dpc22 commented 3 years ago

Okay, trial and error revealed that the additional VERP is caused by: ../lib/Sympa/Spindle/ToList.pm

       $new_message->{shelved}{tracking} ||= 'verp';

Is there any particular reason that I shouldn't just comment that line out? I don't think that VERP does anything useful in my environment, and it runs like a snail when people send large messages to lists with lots of invalid/expired addresses.

The bounce messages generated by normal non-VERP messages sent to the list are good enough to trigger the bounce count for invalid@cam.ac.uk, so right now I'm struggling to see the point in VERP at all. It is possible that I am missing something obvious.

agouaux commented 3 years ago

I’d love to see such functionality as well! (We’re also a university and so also have to perform mass cleaning….)

On Fri, Oct 8, 2021 at 08:15 dpc22 @.***> wrote:

Stupid question time. Is there a command line tool that would allow me to remove a specific set of subscribers from a list? I have written a Perl script which dumps the current membership using:

sympa.pl --dump_users --list @.*** --role=member

and manipulates the member.dump file which is generated to remove invalid/expired @.*** addresses. However the reverse operation:

sympa.pl --restore_users --list @.*** --role=member

only appears to allow you do add new users, not remove existing ones. Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sympa-community/sympa/issues/1246#issuecomment-938634836, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGQY2NONBQ2A3USUS33PEL3UF3VIPANCNFSM5FRM26PA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

dpc22 commented 3 years ago

I'm pretty sure that the repeat VERP messages to a single invalid/expired recipient are a consequence of a high priority message interrupting a long running job, which is restarted later on. Note the the same Sympa::Message is involved in all the below. That would explain three separate messages to each VERP recipient for Mesage-ID:

LO3P123MB289038B43701E5A8D1C317F095B19@LO3P123MB2890.GBRP123.PROD.OUTLOOK.COM

Oct  7 09:35:03 lists-1 bulk[11316]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 155 seconds after scheduled expedition date)
Oct  7 09:59:26 lists-1 bulk[23312]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 1618 seconds after scheduled expedition date)
Oct  7 10:59:21 lists-1 bulk[14089]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 5213 seconds after scheduled expedition date)
dpc22 commented 3 years ago

I have spent some time looking at the code, and I am pretty sure that it is safe to just disable the following in ./lib/Sympa/Spindle/ToList.pm

$new_message->{shelved}{tracking} ||= 'verp';

That should cause .../lib/Sympa/Spindle/ProcessOutgoing.pm to fall back to a single message to all of the bouncing addresses.

    } else {
        # message doesn't need personalization, so can be sent by packet.

That certainly works on a test rig. I might try that on my live servers tomorrow/Monday if I am feeling brave.

dpc22 commented 3 years ago

So, that worked rather nicely.

Long term/15 minute Load average for last 14 days.

The two changes from https://github.com/sympa-community/sympa/issues/1238 were deployed on 30th September. That removed the consistent background load on the system for list digests and task_manager.pl.

I commented out:

$new_message->{shelved}{tracking} ||= 'verp';

this morning, and that seems to have stopped the occasional load spikes that you can see last working week (Monday 4rd October to 8th October). I think that they were all caused by VERP for invalid/expired addresses.

These three changes combined have substantially improved my experience of Sympa. It is still a bit of a memory hog, but it is no longer a CPU hog with 4300 active lists, and 3300 to migrate from Mailman.

dpc22 commented 3 years ago

I spent some time thinking about VERP and tracking at the weekend.

I'm pretty sure than "tracking=VERP" is only useful when verp_rate is greater than 0% on a given list. In that situation VERP might pick up bounces that the normal bounce processor would not pick up. Once you have found a bounce with VERP, it is important that you continue to use VERP. Otherwise the next message without VERP will reset the bounce state. I suggest:

if  ($list->{'admin'}{'verp_rate'} ne "0%") {
    $new_message->{shelved}{tracking} ||= 'verp';
}

None of my lists use VERP. In fact I disabled it completely this morning in /etc/sympa/edit_list.conf, as it seems to cause more harm than good in my environment.

dpc22 commented 3 years ago

@ikedas:

I seem to have ended up with a number of different issues and suggestions on this thread:

  1. Switching off tracking=verp entirely when verp_rate is 0% on a list.
  2. Duplicate messages when tracking=verp is enabled (maybe caused by high priority messages?). I don't actually care if I can just switch the tracking off for my lists
  3. bulk.pl dying with DIED: Illegal division by zero if I try to set bulk_max_count = 6
  4. A request for a command line interface to remove addresses from a Sympa lists, so that I can write a remove_bad_local.pl script to purge my lists.

Should I create a number of separate tickets? Or would like me to shut up for a bit?

racke commented 3 years ago

@dpc22: thanks for your exhaustive report :+1:

I would like to see separate tickets for 1.-3. For 4. please check #911.

ikedas commented 3 years ago

@dpc22,

dpc22 commented 3 years ago
  1. Thanks
  2. Yes, I think so (one address had 5 emails sent out as a consequence of two actual emails sent to a list). There is quite a lot of detail above. I will see if I can separate out the pertinent bits
  3. Will do.
  4. Excellent. It looks like that was only merged last week, so that was good timing.