Closed Ner0nzz closed 1 year ago
We do get intermittent failures when accessing the Pretix API, which causes whichever sync task is running at the time to fail. Since we run several thousand sync jobs per day, the only thing we can do is to log them. I've talked to @ichub about the possibility of auto-retrying failed API requests, so that intermittent failures might be less likely to disrupt the entire sync job.
@robknight - how certain are you that this error is an intermittent failure vs. one that is caused by a real bug?
another consideration that I was wondering about:
for some organizers, we have a Pretix Sync
running on multiple machines - at the very least both in production and staging. Before, when the sync was one way, this was fine. Now that we have a two-way sync, are there any scenarios where this may cause problems? Eg. if I check in on staging
, would that get synced up all the way to production
? And then if I un-check-in on Pretix, would that get synced down properly to both staging
and production
?
how certain are you that this error is an intermittent failure vs. one that is caused by a real bug?
@ichub Not at all (though I think it's likely - there's basically only two exceptions you can get here: either the request to Pretix failed at the network level, or Pretix returned a non-2xx response). To figure it out, we'd have to correlate this error with another one thrown at the same time, which contains the actual underlying error. Probably I need to tidy up what actually goes to Rollbar, because I don't seem to be able to get very much useful information out of the report. Possibly we need some kind of correlation code to tie together errors that are part of the same "transaction".
Eg. if I check in on staging, would that get synced up all the way to production? And then if I un-check-in on Pretix, would that get synced down properly to both staging and production?
We're using the same Pretix instance for staging and production? If so, then yes, this could definitely happen.
We're using the same Pretix instance for staging and production? If so, then yes, this could definitely happen.
do you expect that to cause problems?
Probably I need to tidy up what actually goes to Rollbar, because I don't seem to be able to get very much useful information out of the report. Possibly we need some kind of correlation code to tie together errors that are part of the same "transaction".
This would be great, I'm going to create an issue for it.
We're using the same Pretix instance for staging and production? If so, then yes, this could definitely happen.
do you expect that to cause problems?
Technicall, no. When either site pulls a check-in update from Pretix, it doesn't matter if that change originated from something happening in Pretix, or something happening via an API call from another environment. Same for pushing check-ins.
However, it might be confusing for users if we do something in staging which causes a check-in/check-out to a ticket that a real production user is relying on. I can't see how we'd avoid this without either having a separate staging Pretix, or by disabling check-in state pushes in staging.
it seems that the failures were caused by a database error: connection reset
error, closing as that is not a bug with the checkin code, but rather a robustiness issue with our sql library. #706 is related.
https://app.rollbar.com/a/0xparc/fix/item/pcd-passport/398
@robknight I believe this is a feature you worked on and I'm seeing that it has experienced several errors in production.