recursecenter / pairing-bot

A Zulip bot that partners people for pair programming practice
MIT License
20 stars 13 forks source link

Missed pairing opportunities on a weekly basis #55

Closed cceckman closed 4 weeks ago

cceckman commented 1 month ago

@iafisher reports:

I'm signed up for pairing Tuesday through Friday, but the last couple of weeks Pairing Bot has skipped Tuesday and Thursday without any message.

A status message from Ian shows T/W/R/F pairing schedule.

@stillgreenmoss notes that we may need to instrument for more information (i.e. log more) to effectively debug this.

cceckman commented 1 month ago

My thoughts on "how to investigate", having done no actual investigation yet:


A little investigation: in http://console.cloud.google.com/logs/query,

resource.type="gae_app"
protoPayload.userAgent="AppEngine-Google; (+http://code.google.com/appengine)"

makes a good starting point for finding the /match invocation.

cceckman commented 1 month ago

A red herring path I went down: timezones.

The crontab says to perform matches at 4:00. Docs say if unspecified, the timezone is UTC.

In the winter, 4:00 UTC is 11:00 EST (EST is UTC - 5:00). In the summer, 4:00 UTC is 00:00 EDT (EDT is UTC - 4:00). So in the winter the pairing bot runs "the day before", computing "tomorrow"'s matches; in the summer, it runs "the same day".

But ListPairingTomorrow does the right thing: it uses GAE's time zone, UTC, and picks up the same day. That's correct regardless of what Eastern time is doing.

cceckman commented 1 month ago

OK, I see something odd in the logs, at least.

A lot of the pairings generated wind up with a BAD_REQUEST message attached. In this case, there's two adjacent messages:

2024/07/30 04:00:02 {"result":"error","msg":"Invalid email '[redacted A]'","code":"BAD_REQUEST"}
2024/07/30 04:00:02 [redacted A] was matched with [redacted B]

"Invalid email" is one of the messages, for several e-mail addresses. There's also one report of [email] is no longer using Zulip.


It looks like this error isn't showing up in the match function; the prefix "Error when trying to send..." isn't showing up in the logs.

sendUserMessage is logging the response body with no additional annotation; but it looks like nothing is classifying that Zulip response with an error. :(

At least for today, I think this is a sufficient explanation for why Ian didn't get a pairing. The bot sent a message to multiple recipients, but one of those recipients was invalid; the message as a whole didn't send. (I haven't check to see if this was also the problem on other days.)


I suspect this could be responsible for a few different dropped pairings.

Some changes, maybe in priority order:

cceckman commented 1 month ago

Zulip documentation confirms that errors will return 4xx/5xx errors, as well as error information in the body.

Neither the "Invalid e-mail" nor the "...is no longer using Zulip" messages have the relevant e-mail as structured data. The API docs have a "tip" saying to start a conversation on the Zulip Zulip instance as to whether there are specific key/value pairs that would be relevant.

cceckman commented 1 month ago

@stillgreenmoss spotted the underlying issue:

So a "real fix" here:

cceckman commented 1 month ago

Current status: I think testbot is using Zulip IDs instead of e-mail for all identity purposes. Including talking to the RC API:

User: [redacted] was at RC last week: false and is at RC this week: true

The one remaining use of e-mail is for review identity. I think that's OK to leave for now; I'll file a follow-up "enhancement".

cceckman commented 1 month ago

OK! #62 is in and deployed to prod Pairing Bot. I tagged the original commits in issue55 if anyone wants to see the history in more detail.

I'm going to leave this open until we've seen all the crons run and take action:

which we'll see trickle in over the next couple weeks.

jdkaplan commented 1 month ago

Potentially related: #70

In cases where someone had chosen to skip their next pairing (by sending a skip command to PB), the matching job would panic when "unskipping" them for the next day. That would prevent the matching job from sending out pairs.


So far, I've seen daily matching, checkin, and end-of-batch succeed (after some fixes to each), so I'll check those boxes now.

jdkaplan commented 4 weeks ago

Here to confirm the last cron job run! The welcome message was sent out a few minutes ago. Logs (maintainers-only link), Zulip message

I haven't seen any more reports of pairing issues in the last week, so I think the Zulip ID and unskip fixes were all we needed. That said, I wasn't watching the logs particularly closely, so (anyone) please feel free to open a new issue if this happens again!