Open dpc22 opened 3 years ago
Here's a graph of load average for the last 12 hours, demonstrating what happened at 09:30 this morning. The system was running flat out until I ran systemctl restart sympa
to restart all of the Sympa mail delivery daemons:
One final comment for the day:
It looks like the 4 bulk.pl processes each stopped processing anything new for about 90 minutes, but not all at the same time:
Oct 7 09:31:13 lists-1 bulk[14089]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595335.1633595335.189988.YYY@lists.cam.ac.uk_0,11305,5650/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 138 seconds after scheduled expedition date)
<no further mentions of bulk[14089] until:>
Oct 7 10:56:00 lists-1 bulk[14089]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <1.5.1633597195.1633597194.442549.sympa@lists.cam.ac.uk_s,11305,8526/s/shelved:dkim_sign> to lists.cam.ac.uk (priority 1) (starting 3365 seconds after scheduled expedition date)
Oct 7 09:35:03 lists-1 bulk[11316]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <XXX@lists.cam.ac.uk> (priority 5) (starting 155 seconds after scheduled expedition date)
<no further mention of bulk[11316] until:>
Oct 7 11:00:08 lists-1 bulk[11316]: notice Sympa::Mailer::store() Done sending message Sympa::Message <5.5.1633596810.1633596811.442688.ZZZ@lists.cam.ac.uk_z,11305,9627/z/shelved:arc_cv=none;dmarc_protect> for Sympa::List <ZZZ@lists.cam.ac.uk> (priority 5) in 3998 seconds since scheduled expedition date
Oct 7 09:51:14 lists-1 bulk[14118]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595335.1633595335.189988.YYY@lists.cam.ac.uk_0,11305,5650/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 1339 seconds after scheduled expedition date)
<no further mention of bulk[14118] until:>
Oct 7 11:30:30 lists-1 bulk[14118]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <1.5.1633600790.1633600790.822450.ZZZ@lists.cam.ac.uk_s,28721,1495/s/shelved:dkim_sign> to Sympa::List <ZZZ@lists.cam.ac.uk> (priority 1) (starting 1840 seconds after scheduled expedition date)
Oct 7 09:59:26 lists-1 bulk[23312]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 1618 seconds after scheduled expedition date)
<no further mention of bulk[23312]: until I restarted Sympa at:>
Oct 7 11:38:21 lists-1 bulk[23312]: notice main::sigterm() Signal TERM received, still processing current task
I think that the machine was slowing recovering by 11:38, but it recovered much faster after systemctl restart sympa
, and there is no evidence of any further VERP involving YYY@lists.cam.ac.uk after that point.
All four bulk.pl processes were stuck on the same YYY@lists.cam.ac.uk list, the one which had received two 9.5 MByte messages around 09:30. ZZZ@lists.cam.ac.uk indicates other, unrelated lists. I observe that a couple of the ZZZ@lists.cam.ac.uk messages are "priority 1". I don't know if those prempt priority 5 messages which are already running.
@dpc22 , what is "invalid address on the list"?
An address which used to be valid but which has expired, so Sympa is collecting a bounce score the subscriber address.
We are a University, so almost 10,000 students arrive and leave each year. After they leave emails to XXX@cam.ac.uk generate a bounce of the form:
XXX@cam.ac.uk is undeliverable: <xxx@cam.ac.uk> is not a known user on this system;
see https://help.uis.cam.ac.uk/email-bounce
Or maybe:
XXX@cam.ac.uk is undeliverable: Address no longer in use - try AAA@BBB instead
I typically purge invalid/expired @cam.ac.uk addresses from Mailman lists once a year (using a remove_bad_local script), a couple of weeks after the undergraduate accounts are cancelled. The list in question belongs to a postgraduate college. Their students are typically here until October rather than leaving in July, so they have a higher proportion of invalid addresses than most of our lists.
The list in question has 1506 subscribers. 103 of those (6.8%) are listed if I dump the bouncing addresses.
Those 103 invalid addresses appear to be enough to stop 4*bulk,pl in their tracks if two people send 9.5 MBytes to the list at about the same time. I have just increased the virtual machine hosting the service to 4 CPUs and 32 GBytes of RAM so that I can increase the number of bulk.pl processes in use.
I'm going to spend some time today writing a remove_bad_local.pl script so that I can clean up some of these lists which have lots of bad addresses. However I think that there is an issue here which needs to be resolved.
Another data point. Here is partial output from:
perl -ne 'print "$1 $2\n" if m&^Oct 7 (\S+) lists-1 bulk\[\d+\].+in (\d+ seconds since scheduled expedition date)&' sympa.1
I infer that "seconds since scheduled expedition" is equivalent to "time spent on queue". Note the large gap between 09:53:48 and 10:28:10, and again up to 10:56:00, when things suddenly exploded back into life.
09:52:18 13 seconds since scheduled expedition date
09:52:24 2 seconds since scheduled expedition date
09:52:33 4 seconds since scheduled expedition date
09:52:33 4 seconds since scheduled expedition date
09:52:42 4 seconds since scheduled expedition date
09:53:48 20 seconds since scheduled expedition date
10:28:10 2 seconds since scheduled expedition date
10:31:27 6 seconds since scheduled expedition date
10:39:32 2 seconds since scheduled expedition date
10:51:09 2 seconds since scheduled expedition date
10:56:00 3365 seconds since scheduled expedition date
10:56:00 3322 seconds since scheduled expedition date
10:56:01 3323 seconds since scheduled expedition date
10:56:01 3323 seconds since scheduled expedition date
10:56:01 3220 seconds since scheduled expedition date
10:56:01 3219 seconds since scheduled expedition date
10:56:01 3219 seconds since scheduled expedition date
10:56:02 3220 seconds since scheduled expedition date
10:56:02 3220 seconds since scheduled expedition date
10:56:02 3220 seconds since scheduled expedition date
10:56:02 3220 seconds since scheduled expedition date
10:56:02 3220 seconds since scheduled expedition date
10:56:03 3221 seconds since scheduled expedition date
10:56:03 3221 seconds since scheduled expedition date
10:56:03 3221 seconds since scheduled expedition date
10:56:03 3221 seconds since scheduled expedition date
10:56:03 3221 seconds since scheduled expedition date
10:56:04 3222 seconds since scheduled expedition date
10:56:04 3222 seconds since scheduled expedition date
10:56:04 3222 seconds since scheduled expedition date
10:56:04 3222 seconds since scheduled expedition date
10:56:04 3222 seconds since scheduled expedition date
Okay bulk.pl fails to start if I set:
bulk_max_count = 6 # Twice the default of 3
I think that the cause is:
Oct 8 08:32:42 lists-1 bulk[34718]: err main::#160 > Sympa::Spindle::spin#83 > Sympa::Spindle::ProcessOutgoing::_twist#422 > Sympa::Mailer::store#118 DIED: Illegal division by zero at /usr/share/sympa/lib/Sympa/Mailer.pm line 118.
which is:
my $maxsmtp =
int($Conf::Conf{'maxsmtp'} / ($self->{redundancy} || 1)) || 1;
Are there constraints about valid values for bulk_max_count ?
verp_rate is set to 0 on the list: that is the default value, which I have not changed.
It appears that Sympa ignores verp_rate and always uses VERP for known bouncing addresses. It isn't clear to me why that is the case.
https://github.com/sympa-community/sympa/issues/837
"As for Sympa, it will automatically use VERP on adresses that have produced errors already. Is it the case for you?"
systemctl restart sympa
at 11:38 yesterday morning appears to have shunted a couple of directories from
/var/spool/sympa/bulk/pct/
to /var/spool/sympa/bulk/bad/pct
$ ls -l /var/spool/sympa/bulk/bad/pct/
total 8
drwxr-x--- 2 sympa sympa 4096 Oct 7 11:38 5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061
drwxr-x--- 2 sympa sympa 4096 Oct 7 11:38 5.5.1633600173.1633600173.240042.ZZZ@lists.cam.ac.uk_z,11305,8194
I infer that is what stopped the VERP processing cascade.
837
"As for Sympa, it will automatically use VERP on adresses that have produced errors already. Is it the case for you?"
Yes, it can be a case.
The contents of my /var/spool/sympa/bulk/bad directory are currently:
403701785 4 drwxr-x--- 4 sympa sympa 4096 Jun 19 2019 .
403701786 4 drwxr-x--- 2 sympa sympa 4096 Oct 7 11:38 ./msg
403888748 6728 -rw-r----- 1 sympa sympa 6886657 Oct 7 11:38 ./msg/5.5.1633600173.1633600173.240042.ZZZ@lists.cam.ac.uk_z,11305,8194
403701787 4 drwxr-x--- 4 sympa sympa 4096 Oct 7 11:38 ./pct
488113070 4 drwxr-x--- 2 sympa sympa 4096 Oct 7 11:38 ./pct/5.5.1633600173.1633600173.240042.ZZZ@lists.cam.ac.uk_z,11305,8194
529536664 4 -rw-r----- 1 sympa sympa 248 Oct 7 10:49 ./pct/5.5.1633600173.1633600173.240042.ZZZ@lists.cam.ac.uk_z,11305,8194/0005
529536665 4 -rw-r----- 1 sympa sympa 249 Oct 7 10:49 ./pct/5.5.1633600173.1633600173.240042.ZZZ@lists.cam.ac.uk_z,11305,8194/0006
488113047 4 drwxr-x--- 2 sympa sympa 4096 Oct 7 11:38 ./pct/5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061
I infer that bulk.pl had almost finished with VERP bounces for the two messages to YYY@lists.cam.ac.uk as there are no pct/*YYY*/000X or msg/*YYY*/* files outstanding (maybe some data was cached in the running bulk.pl?). It is possible that it would have just recovered if I had been patient, although 2 hours work for two messages to a single list is a lot.
A separate message 6 MByte message to ZZZ@lists.cam.ac.uk was also pending. "0005" and "0006" each contain 11 (presumably suspect) email addresses scheduled for a VERP test. I infer that it had already finished with 0001 through 0004.
Stupid question time. Is there a command line tool that would allow me to remove a specific set of subscribers from a list? I have written a Perl script which dumps the current membership using:
sympa.pl --dump_users --list test-dpc22@lists.cam.ac.uk --role=member
and manipulates the member.dump file which is generated to remove invalid/expired XXX@cam.ac.uk addresses. However the reverse operation:
sympa.pl --restore_users --list test-dpc22@lists.cam.ac.uk --role=member
only appears to allow you do add new users, not remove existing ones. Thanks.
Okay, trial and error revealed that the additional VERP is caused by: ../lib/Sympa/Spindle/ToList.pm
$new_message->{shelved}{tracking} ||= 'verp';
Is there any particular reason that I shouldn't just comment that line out? I don't think that VERP does anything useful in my environment, and it runs like a snail when people send large messages to lists with lots of invalid/expired addresses.
The bounce messages generated by normal non-VERP messages sent to the list are good enough to trigger the bounce count for invalid@cam.ac.uk, so right now I'm struggling to see the point in VERP at all. It is possible that I am missing something obvious.
I’d love to see such functionality as well! (We’re also a university and so also have to perform mass cleaning….)
On Fri, Oct 8, 2021 at 08:15 dpc22 @.***> wrote:
Stupid question time. Is there a command line tool that would allow me to remove a specific set of subscribers from a list? I have written a Perl script which dumps the current membership using:
sympa.pl --dump_users --list @.*** --role=member
and manipulates the member.dump file which is generated to remove invalid/expired @.*** addresses. However the reverse operation:
sympa.pl --restore_users --list @.*** --role=member
only appears to allow you do add new users, not remove existing ones. Thanks.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sympa-community/sympa/issues/1246#issuecomment-938634836, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGQY2NONBQ2A3USUS33PEL3UF3VIPANCNFSM5FRM26PA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I'm pretty sure that the repeat VERP messages to a single invalid/expired recipient are a consequence of a high priority message interrupting a long running job, which is restarted later on. Note the the same Sympa::Message is involved in all the below. That would explain three separate messages to each VERP recipient for Mesage-ID:
LO3P123MB289038B43701E5A8D1C317F095B19@LO3P123MB2890.GBRP123.PROD.OUTLOOK.COM
Oct 7 09:35:03 lists-1 bulk[11316]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 155 seconds after scheduled expedition date)
Oct 7 09:59:26 lists-1 bulk[23312]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 1618 seconds after scheduled expedition date)
Oct 7 10:59:21 lists-1 bulk[14089]: notice Sympa::Spindle::ProcessOutgoing::_twist() Start sending message Sympa::Message <5.5.1633595548.1633595553.612842.YYY@lists.cam.ac.uk_0,11305,8061/0/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect;tracking=verp> to Sympa::List <YYY@lists.cam.ac.uk> (priority 5) (starting 5213 seconds after scheduled expedition date)
I have spent some time looking at the code, and I am pretty sure that it is safe to just disable the following in ./lib/Sympa/Spindle/ToList.pm
$new_message->{shelved}{tracking} ||= 'verp';
That should cause .../lib/Sympa/Spindle/ProcessOutgoing.pm
to fall back to a single message to all of the bouncing addresses.
} else {
# message doesn't need personalization, so can be sent by packet.
That certainly works on a test rig. I might try that on my live servers tomorrow/Monday if I am feeling brave.
So, that worked rather nicely.
Long term/15 minute Load average for last 14 days.
The two changes from https://github.com/sympa-community/sympa/issues/1238 were deployed on 30th September. That removed the consistent background load on the system for list digests and task_manager.pl.
I commented out:
$new_message->{shelved}{tracking} ||= 'verp';
this morning, and that seems to have stopped the occasional load spikes that you can see last working week (Monday 4rd October to 8th October). I think that they were all caused by VERP for invalid/expired addresses.
These three changes combined have substantially improved my experience of Sympa. It is still a bit of a memory hog, but it is no longer a CPU hog with 4300 active lists, and 3300 to migrate from Mailman.
I spent some time thinking about VERP and tracking at the weekend.
I'm pretty sure than "tracking=VERP" is only useful when verp_rate is greater than 0% on a given list. In that situation VERP might pick up bounces that the normal bounce processor would not pick up. Once you have found a bounce with VERP, it is important that you continue to use VERP. Otherwise the next message without VERP will reset the bounce state. I suggest:
if ($list->{'admin'}{'verp_rate'} ne "0%") {
$new_message->{shelved}{tracking} ||= 'verp';
}
None of my lists use VERP. In fact I disabled it completely this morning in /etc/sympa/edit_list.conf
, as it seems to cause more harm than good in my environment.
@ikedas:
I seem to have ended up with a number of different issues and suggestions on this thread:
DIED: Illegal division by zero
if I try to set bulk_max_count = 6
Should I create a number of separate tickets? Or would like me to shut up for a bit?
@dpc22: thanks for your exhaustive report :+1:
I would like to see separate tickets for 1.-3. For 4. please check #911.
@dpc22,
Version
6.2.66
Installation method
My own RPM, derived from the "official" RHEL rpms
Expected behavior
Mail delivery in Sympa shouldn't grind to a halt because of a couple of (admittedly large) messages sent to a list with large numbers of invalid subscribers.
Actual behavior
This appears to the case. I appreciate that this might just be a lack of understanding on my part in terms of the tunables that are available. At the moment we are running 4 bulk.pl and a single sympa_msg.pl. I plan to increase to 8 bulk.pl and maybe 2 sympa_msg.pl tomorrow morning, although the individual daemons do seem to consume a lot of memory.
Additional information
Two separate people sent 9.5 MByte messages to a single mailing list which had 1506 subscribers this morning (6.8% of which are invalid: typically expired @cam.ac.uk addresses which should be removed from the list).
(I am using XXX and YYY to censor usernames and listnames respectively below. I hope that I have done this consistently)
The email was delivered to the valid members of the list within about 5 minutes in each case. However something strange happened with the invalid addresses:
Sympa generated VERP messages to these recipients, despite the fact that VERP is not enabled on the list (verp_rate=0%) A bit of Googling suggests that is deliberate, although it is not clear why given that Sympa is clearly able to identify the invalid addresses on the list without needing to use VERP.
Much stranger: Sympa sent multiple VERP messages for each invalid address on the list:
This is the same (invalid) XXX@cam.ac.uk email address in all 5 cases. Each message was 9.5 MBytes in size.
Three of the messages are Message-ID: LO3P123MB289038B43701E5A8D1C317F095B19@LO3P123MB2890.GBRP123.PROD.OUTLOOK.COM while two are Message-ID: LOYP123MB33430252A499F53A5B0719A3CFB19@LOYP123MB3343.GBRP123.PROD.OUTLOOK.COM
These correspond to the two messages sent to the list at 09:19:25 and 09:32:33 this morning. Sympa was stil generating VERP requests and processsing the bounces 2 hours later.
I spotted that there was a problem at around 11:30: Sympa was spending all of its time generating and processing these VERP probes, and no useful work seemed to be taking place.
I restarting Sympa (systemctl restart sympa) and the VERP requests seemed to stop immediately (how are these tracked?).
It then took the machine about 30 minutes to catch up from the backlog of email which had formed.
Does anyone have a suggestion as to what might be going on here? I can understand one VERP message, but not multiple for every invalid recipient on the list.
Here are the Sympa logs which mention the specific @cam.ac.uk email address above (they are all bounce messages processed by bounced).
I can also see the following. I don't know what "starting 5213 seconds after scheduled expedition date" means.
I infer that Sympa really doesn't like large emails, although I'm not going to be able to stop people sending them now they all have 100 Gbyte quotas on Exchange Online.