Missed pairing opportunities on a weekly basis

cceckman commented 1 month ago

@iafisher reports:

I'm signed up for pairing Tuesday through Friday, but the last couple of weeks Pairing Bot has skipped Tuesday and Thursday without any message.

A status message from Ian shows T/W/R/F pairing schedule.

@stillgreenmoss notes that we may need to instrument for more information (i.e. log more) to effectively debug this.

cceckman commented 1 month ago

My thoughts on "how to investigate", having done no actual investigation yet:

Where is the intent to match Ian getting dropped?
- Is there a way to run the pairing algorithm locally, as f(real preferences, simulated date), and see what pairings it spits out? What messages?
- What happens if a message doesn't get sent? Is there a (noisy) error? What if a message does get sent -- do we have a confirmation?

A little investigation: in http://console.cloud.google.com/logs/query,

resource.type="gae_app"
protoPayload.userAgent="AppEngine-Google; (+http://code.google.com/appengine)"

makes a good starting point for finding the /match invocation.

cceckman commented 1 month ago

A red herring path I went down: timezones.

The crontab says to perform matches at 4:00. Docs say if unspecified, the timezone is UTC.

In the winter, 4:00 UTC is 11:00 EST (EST is UTC - 5:00). In the summer, 4:00 UTC is 00:00 EDT (EDT is UTC - 4:00). So in the winter the pairing bot runs "the day before", computing "tomorrow"'s matches; in the summer, it runs "the same day".

But ListPairingTomorrow does the right thing: it uses GAE's time zone, UTC, and picks up the same day. That's correct regardless of what Eastern time is doing.

cceckman commented 1 month ago

OK, I see something odd in the logs, at least.

A lot of the pairings generated wind up with a BAD_REQUEST message attached. In this case, there's two adjacent messages:

2024/07/30 04:00:02 {"result":"error","msg":"Invalid email '[redacted A]'","code":"BAD_REQUEST"}
2024/07/30 04:00:02 [redacted A] was matched with [redacted B]

"Invalid email" is one of the messages, for several e-mail addresses. There's also one report of [email] is no longer using Zulip.

It looks like this error isn't showing up in the match function; the prefix "Error when trying to send..." isn't showing up in the logs.

sendUserMessage is logging the response body with no additional annotation; but it looks like nothing is classifying that Zulip response with an error. :(

At least for today, I think this is a sufficient explanation for why Ian didn't get a pairing. The bot sent a message to multiple recipients, but one of those recipients was invalid; the message as a whole didn't send. (I haven't check to see if this was also the problem on other days.)

I suspect this could be responsible for a few different dropped pairings.

Some changes, maybe in priority order:

[x] Configure the Golang logger to log the file name and file number of the log statement. This is Immensely Useful for debugging -- it doesn't quite turn the logs into a trace, but it makes it a lot easier to line things up.
[x] A non-2xx HTTP response doesn't cause an error; we should at least check the HTTP response code for the Zulip responses.
[x] Check the Zulip API, and if needed, perform more-complete error checking for Zulip replies. Do we expect Zulip to return a non-2xx error for these (BAD_REQUEST)? Or do we need to check the body?
[x] Alerting! Proactive telemetry, not reactive log-looking; if the bot fails, or fails repeatedly over some amount of time, there should be something that says "hey go look at the logs pls?".
[x] Data cleanup: it looks like there are some users that Zulip considers invalid. Why? How can/should we know, so we can remove them from the pairing set?

cceckman commented 1 month ago

Zulip documentation confirms that errors will return 4xx/5xx errors, as well as error information in the body.

Neither the "Invalid e-mail" nor the "...is no longer using Zulip" messages have the relevant e-mail as structured data. The API docs have a "tip" saying to start a conversation on the Zulip Zulip instance as to whether there are specific key/value pairs that would be relevant.

[x] Consider following up with Zulip: can the invalid e-mail(s) be returned as a structured error response?

This would be one way (not the only way) for us to do more graceful error handling; we could punt the user that is valid down the list and try to pair them again.
[x] An alternative: check each user before pairing them / filter down the list of pairing candidates to "current active users", based on querying Zulip.

(It is possible, but unlikely, that the user's e-mail could change between the check query and the send-message.)
[x] Do we need to use e-mails at all, or could we sent to Zulip users by-ID? The above points still apply - Zulip users could disappear - but it seems more stable.

It looks like the RC API includes the Zulip ID...

cceckman commented 1 month ago

@stillgreenmoss spotted the underlying issue:

Pairing Bot uses the Zulip ID as the lookup key.
...but conducts pairing, "left batch", messages, etc. based on the e-mail address.
The e-mail address is mutable on the Zulip side without syncing to RC (?), and is only updated in the pairing DB when the user interacts.

So a "real fix" here:

[x] Use the Zulip ID consistently: as the pairing DB key, as the message key, and as the key to join those databases with the RC API.

cceckman commented 1 month ago

Current status: I think testbot is using Zulip IDs instead of e-mail for all identity purposes. Including talking to the RC API:

User: [redacted] was at RC last week: false and is at RC this week: true

The one remaining use of e-mail is for review identity. I think that's OK to leave for now; I'll file a follow-up "enhancement".

cceckman commented 1 month ago

OK! #62 is in and deployed to prod Pairing Bot. I tagged the original commits in issue55 if anyone wants to see the history in more detail.

I'm going to leave this open until we've seen all the crons run and take action:

[x] Daily match-making
[x] End-of-batch: should run this Saturday, and @jdkaplan should get a message
[x] Status report: Thursday (2024-08-08)
[x] Welcome: second Tuesday of the next batch (2024-08-20)

which we'll see trickle in over the next couple weeks.

jdkaplan commented 1 month ago

Potentially related: #70

In cases where someone had chosen to skip their next pairing (by sending a skip command to PB), the matching job would panic when "unskipping" them for the next day. That would prevent the matching job from sending out pairs.

So far, I've seen daily matching, checkin, and end-of-batch succeed (after some fixes to each), so I'll check those boxes now.

jdkaplan commented 4 weeks ago

Here to confirm the last cron job run! The welcome message was sent out a few minutes ago. Logs (maintainers-only link), Zulip message

I haven't seen any more reports of pairing issues in the last week, so I think the Zulip ID and unskip fixes were all we needed. That said, I wasn't watching the logs particularly closely, so (anyone) please feel free to open a new issue if this happens again!

recursecenter / pairing-bot

Missed pairing opportunities on a weekly basis #55