Open dpc22 opened 2 months ago
Hi @dpc22,
I think that the only significant customization is Steve Shipley's modlist plugin. I'm not in a position to disable this on our live lists servers as (literally) thousands of list owners depend on this. Other that this there are mostly fixes that you yourself have proposed to problems that I have had with 6.2.72, namely:
https://github.com/sympa-community/sympa/issues/1685 https://github.com/sympa-community/sympa/issues/1699 https://github.com/sympa-community/sympa/issues/1725 https://github.com/sympa-community/sympa/issues/1862
I do have a test server which is fair game, but I am unable to reproduce the issue at the moment either on the live server or the test server, which unfortunately is often the case with awkward race conditions.
I attached the logs to the original bug report as a text attachment "sympa.txt". Everything happens in the space of 4 seconds, and appears to be a consequence of my user clicking a link twice in the moderation message that I quoted:
To distribute the attached message in list ucam-language-sciences:
https://lists.cam.ac.uk/sympa/distribute/ucam-language-sciences/e9f3e269192343c654798f6681a91f37
Here are the corresponding entries in my Apache access_log, which correspond to log entries in the sympa.txt log.
10.249.206.106 - - [02/Sep/2024:09:29:36 +0100] "GET /sympa/distribute/ucam-language-sciences/e9f3e269192343c654798f6681a91f37 HTTP/1.1" 200 16052 "https://shib.raven.cam.ac.uk/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.6 Safari/605.1.15"
10.249.206.106 - - [02/Sep/2024:09:29:40 +0100] "GET /sympa/distribute/ucam-language-sciences/e9f3e269192343c654798f6681a91f37 HTTP/1.1" 200 16691 "https://shib.raven.cam.ac.uk/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.6 Safari/605.1.15"
At the moment the obvious smoking gun is the following file in /var/spool/sympa/moderation/
-rw-r----- 1 sympa sympa 17859 Sep 2 09:29 ucam-language-sciences@lists.cam.ac.uk_e9f3e269192343c654798f6681a91f37.distribute
which is the missing message.
This file is still there 23 hours later, and remains the only file with a ".distribute" suffix in that directory. I don't know if these .distribute files time out after a certain number of days or weeks. The fact that there is only one of them feels significant. What we are looking at is not normal on a busy Sympa server with 8000 active lists.
I think that the only significant customization is Steve Shipley's modlist plugin. I'm not in a position to disable this on our live lists servers as (literally) thousands of list owners depend on this. Other that this there are mostly fixes that you yourself have proposed to problems that I have had with 6.2.72, namely:
1685 #1699 #1725 #1862
I do have a test server which is fair game, but I am unable to reproduce the issue at the moment either on the live server or the test server, which unfortunately is often the case with awkward race conditions.
By undoing all these customizations and modifications It would be more helpful for other users to try to reproduce the problem. But that is not absolutely necessary.
In any case, you need to reproduce the problem on your end to get a clue as to how to solve it.
I attached the logs to the original bug report as a text attachment "sympa.txt". Everything happens in the space of 4 seconds, and appears to be a consequence of my user clicking a link twice in the moderation message that I quoted:
To distribute the attached message in list ucam-language-sciences: https://lists.cam.ac.uk/sympa/distribute/ucam-language-sciences/e9f3e269192343c654798f6681a91f37
If you saw this log and decided that it was a sign that "someone clicks on that link twice in rapid succession", you are wrong. When the user follows on the disribute
link, they are demanded confirmation, and the process is executed only when they click on the submit button.
Visiting the link and clicking on the button will log do_distribute
action twice. Thus, this is not a user accidentally clicking twice in a short period. This is expected normal operation.
Here are the corresponding entries in my Apache access_log, which correspond to log entries in the sympa.txt log.
10.249.206.106 - - [02/Sep/2024:09:29:36 +0100] "GET /sympa/distribute/ucam-language-sciences/e9f3e269192343c654798f6681a91f37 HTTP/1.1" 200 16052 "https://shib.raven.cam.ac.uk/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.6 Safari/605.1.15" 10.249.206.106 - - [02/Sep/2024:09:29:40 +0100] "GET /sympa/distribute/ucam-language-sciences/e9f3e269192343c654798f6681a91f37 HTTP/1.1" 200 16691 "https://shib.raven.cam.ac.uk/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.6 Safari/605.1.15"
At the moment the obvious smoking gun is the following file in /var/spool/sympa/moderation/
-rw-r----- 1 sympa sympa 17859 Sep 2 09:29 ucam-language-sciences@lists.cam.ac.uk_e9f3e269192343c654798f6681a91f37.distribute
which is the missing message.
This file is still there 23 hours later, and remains the only file with a ".distribute" suffix in that directory. I don't know if these .distribute files time out after a certain number of days or weeks. The fact that there is only one of them feels significant. What we are looking at is not normal on a busy Sympa server with 8000 active lists.
Renaming u(...)7.distribute
in moderation spool to u(...)7
may allow the message to be distributed again. By this, you may try to reproduce the problem.
If you saw this log and decided that it was a sign that "someone clicks on that link twice in rapid succession", you are wrong. When the user follows on the disribute link, they are demanded confirmation, and the process is executed only when they click on the submit button.
I just tried this on a test list and there was no confirmation dialogue.
The following URL:
https://lists.cam.ac.uk/sympa/distribute/test-dpc22/ebe7220098759658fd75ad6d302bd450
flashed up a message "distribute: action will complete soon", and then showed me the output of:
https://lists.cam.ac.uk/sympa/modindex/test-dpc22
without any HTTP redirect. In fact I ran the same test before I submitted the original ticket to confirm that there wasn't a confirmation dialogue, before I asserted that this was the case there.
Is is possible this behaviour is influenced by a setting in sympa.conf or a scenari rule? The only custom scenari that we have are send.* rules.
We do use web_sso in /etc/sympa/auth.conf (Shibboleth), but I do see the same thing if I use a username and password to authenticate instead.
Had the message been distributed?
I see a different message in my Web browser if the message was already distributed:
ERROR (distribute) - Unable to access the message on list test-dpc22 with key ebe7220098759658fd75ad6d302bd450. Warning: this message may already have been sent by one of the list's moderators.
(I think that is what you are asking here).
What about log output?
I think that you have all the log output that I have now: the sympa.txt attachment at the bottom of the original bug report, plus the two lines from the Apache access_log later on.
I'm happy to ramp up logging on my test server or add custom logging to either the test or live system, but that probably isn't going to get us very far if I can't reproduce an issue reported by a random list owners. I don't think that this can be common given that this is the first time that a list owner has reported a problem to me.
I think that you have all the log output that I have now: the sympa.txt attachment at the bottom of the original bug report, plus the two lines from the Apache access_log later on.
No. As long as the operation was performed at different times and under different conditions, it is impossible for the exact same log to be output. Unless objective information is provided that can reproduce the problem, it cannot be investigated for resolution.
I'm going to bed for now. Please continue to keep us informed if you like.
I wrote:
If you saw this log and decided that it was a sign that "someone clicks on that link twice in rapid succession", you are wrong. When the user follows on the
disribute
link, they are demanded confirmation, and the process is executed only when they click on the submit button.Visiting the link and clicking on the button will log
do_distribute
action twice. Thus, this is not a user accidentally clicking twice in a short period. This is expected normal operation.
This was my misunderstanding. Strangely, Sympa currently confirms the distribute
operation only when the message needs to be tagged. This is a bug and should be fixed separately.
Anyway, it is normal behaviour that "Unable to access the message" is reported on the second click. The anomaly is that the first click has not initiated message delivery.
More information is awaited.
No. As long as the operation was performed at different times and under different conditions, it is impossible for the exact same log to be output. Unless objective information is provided that can reproduce the problem, it cannot be investigated for resolution.
I only have a single operation which led to a missing message, and I attached the logs relating to that event to this ticket.
The only part that I elided was the original message delivery which generated the moderation request, which I can include if you like. The list moderator (who was the same person to send the message) clicked on the "distribute" link in the moderation message that she received 27 seconds later, and again 4 seconds after that.
I agree that we need a reproducible test case, but as I indicated yesterday I'm not having any success in generating one.
The only obvious curiosity that I have to work with is the following in /var/spool/sympa/moderation/
-rw-r----- 1 sympa sympa 17859 Sep 2 09:29 ucam-language-sciences@lists.cam.ac.uk_e9f3e269192343c654798f6681a91f37.distribute
which is still there two days later. This remains the only ".distribute" file in that directory, on a busy Sympa system.
If I have some spare time later today I will trace through the code and see if I can work out what generated that specific file, which appears to have been orphaned.
FWIW, a successful moderation cycle on the same list a few hours later looks like the following:
Sep 2 11:06:28 lists-1 wwsympa[3097453]: info main::do_distribute(9420992658abbf17fc12def51d885605) [robot lists.cam.ac.uk] [session 86881337951441] [client 10.249.206.106] [user XXX@cam.ac.uk] [list ucam-language-sciences]
Sep 2 11:06:28 lists-1 wwsympa[3097453]: info main::do_modindex() [robot lists.cam.ac.uk] [session 86881337951441] [client 10.249.206.106] [user XXX@cam.ac.uk] [list ucam-language-sciences]
Sep 2 11:06:30 lists-1 sympa_msg[1946661]: notice Sympa::Request::Message::__parse() Parsing: QUIET DISTRIBUTE ucam-language-sciences 9420992658abbf17fc12def51d885605
Sep 2 11:06:30 lists-1 sympa_msg[1946661]: notice Sympa::Spindle::ProcessMessage::_twist() Processing Sympa::Request <action=distribute;context=ucam-language-sciences@lists.cam.ac.uk>
Sep 2 11:06:30 lists-1 sympa_msg[1946661]: notice Sympa::Spool::store() Sympa::Message <ucam-language-sciences@lists.cam.ac.uk_9420992658abbf17fc12def51d885605.distribute/shelved:arc_cv=pass;dkim_sign> is stored into Sympa::Spool::Archive as <1725271590.1725271590.678475.ucam-language-sciences@lists.cam.ac.uk,1946661,9580>
Sep 2 11:06:30 lists-1 sympa_msg[1946661]: notice Sympa::Spool::store() Sympa::Message <ucam-language-sciences@lists.cam.ac.uk_9420992658abbf17fc12def51d885605.distribute/shelved:arc_cv=pass;dkim_sign> is stored into Sympa::Spool::Digest <ucam-language-sciences@lists.cam.ac.uk> as <1725271590.1725271590.693644,1946661,5903>
Sep 2 11:06:30 lists-1 sympa_msg[1946661]: notice Sympa::Spindle::ToList::_send_msg() No non VERP subscribers left to distribute message to list Sympa::List <ucam-language-sciences@lists.cam.ac.uk>
Sep 2 11:06:30 lists-1 sympa_msg[1946661]: notice Sympa::Spindle::ToList::_send_msg() No non VERP subscribers left to distribute message to list Sympa::List <ucam-language-sciences@lists.cam.ac.uk>
Sep 2 11:06:30 lists-1 sympa_msg[1946661]: notice Sympa::Spool::Outgoing::store() Message Sympa::Message <ucam-language-sciences@lists.cam.ac.uk_9420992658abbf17fc12def51d885605.distribute/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect> is stored into bulk spool as <5.5.1725271590.1725271590.704931.ucam-language-sciences@lists.cam.ac.uk_z,1946661,7916>
Sep 2 11:06:30 lists-1 sympa_msg[1946661]: notice Sympa::Spool::Outgoing::store() Message Sympa::Message <ucam-language-sciences@lists.cam.ac.uk_9420992658abbf17fc12def51d885605.distribute/shelved:arc_cv=pass;decorate;dkim_sign;dmarc_protect> is stored into bulk spool as <5.5.1725271590.1725271590.714288.ucam-language-sciences@lists.cam.ac.uk_0,1946661,6450>
Sep 2 11:06:30 lists-1 sympa_msg[1946661]: notice Sympa::Spindle::ToList::_send_msg() No non VERP subscribers left to distribute message to list Sympa::List <ucam-language-sciences@lists.cam.ac.uk>
Sep 2 11:06:30 lists-1 sympa_msg[1946661]: info Sympa::Spindle::ToList::_twist() Message Sympa::Message <ucam-language-sciences@lists.cam.ac.uk_9420992658abbf17fc12def51d885605.distribute/shelved:arc_cv=pass;dkim_sign> for Sympa::List <ucam-language-sciences@lists.cam.ac.uk> from XXX@cam.ac.uk accepted (0.04 seconds, 14 sessions, 561 subscribers), message ID=LO4P265MB59320B6DB9284366BCD7DF3DF7922@LO4P265MB5932.GBRP265.PROD.OUTLOOK.COM, size=18119
compared to the one that failed, which has a repeated do_distribute action, and two different errrors:
Sep 2 09:29:31 lists-1 wwsympa[3096790]: info main::do_login() [robot lists.cam.ac.uk] [session 95252205934297] [client 10.249.206.106] [list ucam-language-sciences]
Sep 2 09:29:36 lists-1 wwsympa[3097526]: info main::do_distribute(e9f3e269192343c654798f6681a91f37) [robot lists.cam.ac.uk] [session 95252205934297] [client 10.249.206.106] [user XXX@cam.ac.uk] [list ucam-language-sciences]
Sep 2 09:29:36 lists-1 wwsympa[3097526]: info main::do_modindex() [robot lists.cam.ac.uk] [session 95252205934297] [client 10.249.206.106] [user XXX@cam.ac.uk] [list ucam-language-sciences]
Sep 2 09:29:40 lists-1 wwsympa[3097453]: info main::do_distribute(e9f3e269192343c654798f6681a91f37) [robot lists.cam.ac.uk] [session 95252205934297] [client 10.249.206.106] [user XXX@cam.ac.uk] [list ucam-language-sciences]
Sep 2 09:29:40 lists-1 sympa_msg[1946661]: notice Sympa::Request::Message::__parse() Parsing: QUIET DISTRIBUTE ucam-language-sciences e9f3e269192343c654798f6681a91f37
Sep 2 09:29:40 lists-1 sympa_msg[1946661]: notice Sympa::Spindle::ProcessMessage::_twist() Processing Sympa::Request <action=distribute;context=ucam-language-sciences@lists.cam.ac.uk>
Sep 2 09:29:40 lists-1 sympa_msg[1946661]: err main::#247 > Sympa::Spindle::spin#95 > Sympa::Spindle::DoCommand::_twist#120 > Sympa::Spindle::spin#95 > Sympa::Request::Handler::distribute::_twist#61 Unable to find message with key <e9f3e269192343c654798f6681a91f37> for list Sympa::List <ucam-language-sciences@lists.cam.ac.uk>
Sep 2 09:29:40 lists-1 wwsympa[3097453]: err main::#1562 > main::do_distribute#7740 [robot lists.cam.ac.uk] [session 95252205934297] [client 10.249.206.106] [user XXX@cam.ac.uk] [list ucam-language-sciences] Unable to find message with <e9f3e269192343c654798f6681a91f37> for list Sympa::List <ucam-language-sciences@lists.cam.ac.uk>
Sep 2 09:29:40 lists-1 wwsympa[3097453]: info main::do_modindex() [robot lists.cam.ac.uk] [session 95252205934297] [client 10.249.206.106] [user XXX@cam.ac.uk] [list ucam-language-sciences]
"Unable to find message with <..." is the normal message for repeated do_distribute actions, leading to the error screen that we were both expecting to see.
"Unable to find message with key <..." (note "with key") is the only instance of that error in the last 30 days of logs.
The key corresponds to the file in /var/spool/sympa/moderation/
-rw-r----- 1 sympa sympa 17859 Sep 2 09:29 ucam-language-sciences@lists.cam.ac.uk_e9f3e269192343c654798f6681a91f37.distribute
Okay, the Sympa::Spool stuff is causing my brain to leak out of my ears,.
I do now understand that wwsympa do_distribute() renames the file in the moderation spool to add a ".distribute" suffix, which should stop repeat distribute actions on the same key. So that is where that file came from. It then injects "QUIET DISTRIBUTE ucam-language-sciences e9f3e269192343c654798f6681a91f37" into the sympa incoming mail queue to be processed by the sympa_msg daemon.
I don't understand why lib/Sympa/Request/Handler/distribute.pm
my $spindle = Sympa::Spindle::ProcessModeration->new(
distributed_by => $sender,
context => $robot,
authkey => $key,
quiet => $request->{quiet}
);
unless ($spindle and $spindle->spin) { # No message.
$log->syslog('err',
'Unable to find message with key <%s> for list %s',
$key, $list);
failed to find the following file in /var/spool/sympa/moderation:
-rw-r----- 1 sympa sympa 17859 2024-09-02 09:29:08.000000000 +0100 ucam-language-sciences@lists.cam.ac.uk_e9f3e269192343c654798f6681a91f37.distribute
with the following error:
Sep 2 09:29:40 lists-1 sympa_msg[1946661]: err main::#247 > Sympa::Spindle::spin#95 > Sympa::Spindle::DoCommand::_twist#120 > Sympa::Spindle::spin#95 > Sympa::Request::Handler::distribute::_twist#61 Unable to find message with key <e9f3e269192343c654798f6681a91f37> for list Sympa::List <ucam-language-sciences@lists.cam.ac.uk>
given lib/Sympa/Spool/Moderation.pm
use constant _marshal_format => '%s@%s_%s%s';
use constant _marshal_keys => [qw(localpart domainpart AUTHKEY validated)];
localpart, domainpart, AUTHKEY all seem to match, and perldoc on that module tells me:
=item {validated}
Keeps a string representing extension, if message has been renamed using
remove() with option.
We also have the following, which seems to match:
use constant _marshal_regexp =>
qr{\A([^\s\@]+)\@([-.\w]+)_([\da-f]+)(.distribute)?\z};
Version
6.2.72
Installation method
My own rpm, derived from "official" rpm for RHEL 9.
Expected behavior
Moderation messages contain some some text of the form:
If someone clicks on that link twice in rapid succession, I would expect the first click to distribute the message, and the second click to generate an error page.
Actual behavior
The user reported the error page, but the message was not distributed.
I did find the following in /var/spool/sympa/moderation/
-rw-r----- 1 sympa sympa 17859 Sep 2 09:29 ucam-language-sciences@lists.cam.ac.uk_e9f3e269192343c654798f6681a91f37.distribute
I don't know what these ".distribute" files are used for, but this is the only file with that suffix in /var/spool/sympa/moderation on a busy server, and it is still there 7 hours later. So I think and hope that this might be a smoking gun for whatever went wrong.
Additional information
Here are the two incoming do_distribute requests for e9f3e269192343c654798f6681a91f37
Sep 2 09:29:36 lists-1 wwsympa[3097526]: info main::do_distribute(e9f3e269192343c654798f6681a91f37) [robot lists.cam.ac.uk] [session 95252205934297] [client 10.249.206.106] [user XXX@cam.ac.uk] [list ucam-language-sciences]
Sep 2 09:29:40 lists-1 wwsympa[3097453]: info main::do_distribute(e9f3e269192343c654798f6681a91f37) [robot lists.cam.ac.uk] [session 95252205934297] [client 10.249.206.106] [user XXX@cam.ac.uk] [list ucam-language-sciences]
sympa_msg generated the following error at 09:29:40
Sep 2 09:29:40 lists-1 sympa_msg[1946661]: err main::#247 > Sympa::Spindle::spin#95 > Sympa::Spindle::DoCommand::_twist#120 > Sympa::Spindle::spin#95 > Sympa::Request::Handler::distribute::_twist#61 Unable to find message with key for list Sympa::List ucam-language-sciences@lists.cam.ac.uk
That is the only time that "Unable to find message with key" appears in my logs in the last 28 days. (I have lots of "Unable to find message with" from wwsympa, but nothing similar from sympa_msg).
This feels very much like a race condition when wwsympa and sympa_msg are trying to work on the same moderation files at the same time. I hope that the single "Unable to find message with key" error combined with the single distribute file in /var/spool/sympa/moderation/ will be enough to narrow down a cause.
I will attach the full logs as a text attachment to this ticket
sympa.txt