proofcarryingdata / zupass

Zuzalu Passport
GNU General Public License v3.0
288 stars 75 forks source link

[rollbar] Checkin sync failed #520

Closed Ner0nzz closed 1 year ago

Ner0nzz commented 1 year ago

https://app.rollbar.com/a/0xparc/fix/item/pcd-passport/398

@robknight I believe this is a feature you worked on and I'm seeing that it has experienced several errors in production.

robknight commented 1 year ago

We do get intermittent failures when accessing the Pretix API, which causes whichever sync task is running at the time to fail. Since we run several thousand sync jobs per day, the only thing we can do is to log them. I've talked to @ichub about the possibility of auto-retrying failed API requests, so that intermittent failures might be less likely to disrupt the entire sync job.

ichub commented 1 year ago

@robknight - how certain are you that this error is an intermittent failure vs. one that is caused by a real bug?

another consideration that I was wondering about:

for some organizers, we have a Pretix Sync running on multiple machines - at the very least both in production and staging. Before, when the sync was one way, this was fine. Now that we have a two-way sync, are there any scenarios where this may cause problems? Eg. if I check in on staging, would that get synced up all the way to production? And then if I un-check-in on Pretix, would that get synced down properly to both staging and production?

robknight commented 1 year ago

how certain are you that this error is an intermittent failure vs. one that is caused by a real bug?

@ichub Not at all (though I think it's likely - there's basically only two exceptions you can get here: either the request to Pretix failed at the network level, or Pretix returned a non-2xx response). To figure it out, we'd have to correlate this error with another one thrown at the same time, which contains the actual underlying error. Probably I need to tidy up what actually goes to Rollbar, because I don't seem to be able to get very much useful information out of the report. Possibly we need some kind of correlation code to tie together errors that are part of the same "transaction".

Eg. if I check in on staging, would that get synced up all the way to production? And then if I un-check-in on Pretix, would that get synced down properly to both staging and production?

We're using the same Pretix instance for staging and production? If so, then yes, this could definitely happen.

ichub commented 1 year ago

We're using the same Pretix instance for staging and production? If so, then yes, this could definitely happen.

do you expect that to cause problems?

ichub commented 1 year ago

Probably I need to tidy up what actually goes to Rollbar, because I don't seem to be able to get very much useful information out of the report. Possibly we need some kind of correlation code to tie together errors that are part of the same "transaction".

This would be great, I'm going to create an issue for it.

robknight commented 1 year ago

We're using the same Pretix instance for staging and production? If so, then yes, this could definitely happen.

do you expect that to cause problems?

Technicall, no. When either site pulls a check-in update from Pretix, it doesn't matter if that change originated from something happening in Pretix, or something happening via an API call from another environment. Same for pushing check-ins.

However, it might be confusing for users if we do something in staging which causes a check-in/check-out to a ticket that a real production user is relying on. I can't see how we'd avoid this without either having a separate staging Pretix, or by disabling check-in state pushes in staging.

ichub commented 1 year ago

it seems that the failures were caused by a database error: connection reset error, closing as that is not a bug with the checkin code, but rather a robustiness issue with our sql library. #706 is related.