Closed cceckman closed 4 weeks ago
My thoughts on "how to investigate", having done no actual investigation yet:
f(real preferences, simulated date)
, and see what pairings it spits out? What messages?A little investigation: in http://console.cloud.google.com/logs/query,
resource.type="gae_app"
protoPayload.userAgent="AppEngine-Google; (+http://code.google.com/appengine)"
makes a good starting point for finding the /match
invocation.
A red herring path I went down: timezones.
The crontab says to perform matches at 4:00. Docs say if unspecified, the timezone is UTC.
In the winter, 4:00 UTC is 11:00 EST (EST is UTC - 5:00). In the summer, 4:00 UTC is 00:00 EDT (EDT is UTC - 4:00). So in the winter the pairing bot runs "the day before", computing "tomorrow"'s matches; in the summer, it runs "the same day".
But ListPairingTomorrow does the right thing: it uses GAE's time zone, UTC, and picks up the same day. That's correct regardless of what Eastern time is doing.
OK, I see something odd in the logs, at least.
A lot of the pairings generated wind up with a BAD_REQUEST
message attached. In this case, there's two adjacent messages:
2024/07/30 04:00:02 {"result":"error","msg":"Invalid email '[redacted A]'","code":"BAD_REQUEST"}
2024/07/30 04:00:02 [redacted A] was matched with [redacted B]
"Invalid email" is one of the messages, for several e-mail addresses. There's also one report of [email] is no longer using Zulip.
It looks like this error isn't showing up in the match function; the prefix "Error when trying to send..." isn't showing up in the logs.
sendUserMessage
is logging the response body with no additional annotation; but it looks like nothing is classifying that Zulip response with an error. :(
At least for today, I think this is a sufficient explanation for why Ian didn't get a pairing. The bot sent a message to multiple recipients, but one of those recipients was invalid; the message as a whole didn't send. (I haven't check to see if this was also the problem on other days.)
I suspect this could be responsible for a few different dropped pairings.
Some changes, maybe in priority order:
[x] Configure the Golang logger to log the file name and file number of the log statement. This is Immensely Useful for debugging -- it doesn't quite turn the logs into a trace, but it makes it a lot easier to line things up.
[x] A non-2xx HTTP response doesn't cause an error; we should at least check the HTTP response code for the Zulip responses.
[x] Check the Zulip API, and if needed, perform more-complete error checking for Zulip replies. Do we expect Zulip to return a non-2xx error for these (BAD_REQUEST
)? Or do we need to check the body?
[x] Alerting! Proactive telemetry, not reactive log-looking; if the bot fails, or fails repeatedly over some amount of time, there should be something that says "hey go look at the logs pls?".
[x] Data cleanup: it looks like there are some users that Zulip considers invalid. Why? How can/should we know, so we can remove them from the pairing set?
Zulip documentation confirms that errors will return 4xx/5xx errors, as well as error information in the body.
Neither the "Invalid e-mail" nor the "...is no longer using Zulip" messages have the relevant e-mail as structured data. The API docs have a "tip" saying to start a conversation on the Zulip Zulip instance as to whether there are specific key/value pairs that would be relevant.
[x] Consider following up with Zulip: can the invalid e-mail(s) be returned as a structured error response?
This would be one way (not the only way) for us to do more graceful error handling; we could punt the user that is valid down the list and try to pair them again.
[x] An alternative: check each user before pairing them / filter down the list of pairing candidates to "current active users", based on querying Zulip.
(It is possible, but unlikely, that the user's e-mail could change between the check query and the send-message.)
[x] Do we need to use e-mails at all, or could we sent to Zulip users by-ID? The above points still apply - Zulip users could disappear - but it seems more stable.
It looks like the RC API includes the Zulip ID...
@stillgreenmoss spotted the underlying issue:
So a "real fix" here:
Current status: I think testbot is using Zulip IDs instead of e-mail for all identity purposes. Including talking to the RC API:
User: [redacted] was at RC last week: false and is at RC this week: true
The one remaining use of e-mail is for review identity. I think that's OK to leave for now; I'll file a follow-up "enhancement".
OK! #62 is in and deployed to prod Pairing Bot. I tagged the original commits in issue55 if anyone wants to see the history in more detail.
I'm going to leave this open until we've seen all the crons run and take action:
which we'll see trickle in over the next couple weeks.
Potentially related: #70
In cases where someone had chosen to skip their next pairing (by sending a skip
command to PB), the matching job would panic when "unskipping" them for the next day. That would prevent the matching job from sending out pairs.
So far, I've seen daily matching, checkin, and end-of-batch succeed (after some fixes to each), so I'll check those boxes now.
Here to confirm the last cron job run! The welcome message was sent out a few minutes ago. Logs (maintainers-only link), Zulip message
I haven't seen any more reports of pairing issues in the last week, so I think the Zulip ID and unskip fixes were all we needed. That said, I wasn't watching the logs particularly closely, so (anyone) please feel free to open a new issue if this happens again!
@iafisher reports:
A
status
message from Ian shows T/W/R/F pairing schedule.@stillgreenmoss notes that we may need to instrument for more information (i.e. log more) to effectively debug this.