Closed willkg closed 4 years ago
I've seen a couple of these happen in prod, but they're more like once a month. When they happen, we see it as an InternalError
rather than an OperationalError
. I think that's because we've updated dependencies and the exception changed.
In any case, we think all evidence suggests this happens in prod, but at a much reduced frequency because prod has data that's more stable, so there's less likelihood of multiple tasks operating on the same data hitting locking issues.
We're going to push off looking into this further until after the prod migration.
These are more frequent in production, so it appears to be the new stack. However, many things changed at once, so it is hard to pin down the cause:
I spent some time looking at the BINARY
fields, like wifi_shard_e.mac
, used as primary keys. A change in PyMySQL required an extra option binary_prefix
, to continue prefixing the values with _binary
in the raw SQL. It is weird to work with binary fields in interactive sessions, but I believe we are using these correctly in the code, and there isn't an escaping bug in SQLAlchemy.
I investigated a deadlock with show engine innodb status \G
. Here's some of the data:
------------------------
LATEST DETECTED DEADLOCK
------------------------
2019-12-03 20:26:46 0x2b5fdf082700
*** (1) TRANSACTION:
TRANSACTION 26842249, ACTIVE 13 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 146 lock struct(s), heap size 24784, 238 row lock(s), undo log entries 240
MySQL thread id 159, OS thread handle 47690764539648, query id 17922983 172.31.46.156 location-rw updating
UPDATE wifi_shard_5 SET max_lat=<omitted>... WHERE wifi_shard_5.mac = _binary'??]???'
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 231 page no 8162 n bits 216 index PRIMARY of table `location`.`wifi_shard_5` trx id 26842249 lock_mode X locks rec but not gap waiting
Record lock, heap no 103 PHYSICAL RECORD: n_fields 20; compact format; info bits 0
...
*** (2) TRANSACTION:
TRANSACTION 26842250, ACTIVE 13 sec starting index read
mysql tables in use 1, locked 1
213 lock struct(s), heap size 24784, 335 row lock(s), undo log entries 338
MySQL thread id 153, OS thread handle 47690763740928, query id 17924237 172.31.46.156 location-rw updating
UPDATE wifi_shard_5 SET max_lat=<omitted> ... WHERE wifi_shard_5.mac = _binary'?:^?-'
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 231 page no 8162 n bits 216 index PRIMARY of table `location`.`wifi_shard_5` trx id 26842250 lock_mode X locks rec but not gap
Record lock, heap no 103 PHYSICAL RECORD: n_fields 20; compact format; info bits 0
...
Record lock, heap no 104 PHYSICAL RECORD: n_fields 20; compact format; info bits 0
...
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 231 page no 7910 n bits 216 index PRIMARY of table `location`.`wifi_shard_5` trx id 26842250 lock_mode X locks rec but not gap waiting
Record lock, heap no 7 PHYSICAL RECORD: n_fields 20; compact format; info bits 0
...
*** WE ROLL BACK TRANSACTION (1)
These are challenging to read, but what I can read is:
INSERT
and bulk_update_mappings
at the end of processing. This seems to be an expected number of locks for updates and insertions.WHERE
clause, but working to get a lock on the same row.My gut says that MySQL 5.7 has changed how locks are granted, and that it is no longer a good idea to to try to update several rows at a time using bulk_update_mappings
, unless we can guarantee that only one update task runs at a time. However, I think we want to instead make it so many of these can run in parallel, so we can scale out our async workers in response to the incoming data.
Instead, I think the station updating tasks should loop over the the stations, running the equivalent of:
for each station:
for a few attempts:
try:
START TRANSACTION;
SELECT ... FOR UPDATE;
UPDATE or INSERT or do nothing
COMMIT;
except DeadLock:
attempt again, or raise if out of luck
Before starting to change code, I want to investigate some other things first:
I suspect the deadlocks are also the root cause of #990, but it will be easier to avoid and work around deadlocks than to determine why a multi-row update failed to update all rows.
One more thing to investigate. As mentioned in https://github.com/mozilla/ichnaea/issues/990#issuecomment-550016444, there is some code to retry for some errors:
The retrial errors include deadlocks:
The constants could change for MySQL versions, or exception details could be different between Python 2.6 and 3.7.
This last idea was fruitful. In PyMySQL 0.7.11, a deadlock was a pymysql.err.InternalError
. In PyMySQL 0.8.0, it changed to pymysql.err.OperationalError
:
https://github.com/PyMySQL/PyMySQL/blob/0f4d45e5a20b47959ba7d16f130cbc0c7ce8506c/CHANGELOG#L39
My guess is that the deadlock is just as frequent in the 1.5.x stack, but a retry handled most cases.
This should be a quick fix, but I'd like to add a metric to track deadlocks caught by this code.
You rock!
Just some random thoughts from my end:
New related issue from fix in stage:
AttributeError: 'StaleDataError' object has no attribute 'orig'
The handler needs to be smart enough to only handle wrapped database driver issues.
With these changes merged, stage has gone from 500 - 1000 uncaught deadlock errors per day to 0 for the station updating task. The new data.station.dberror
metric is recording 3-7 deadlocks (1213) per hour and 0-3 lock timeouts (1205) per hour. It's not clear why the instance count would be 6x greater with the Sentry exception recorder versus the exception handler with a 1 second sleep, but I'll take a happy mystery.
The code hasn't been deployed to production yet. It does not appear to have an effect on the StaleDataError
rate (issue #990).
There's some additional follow-on work that we could track in new bugs, or as part of #990:
StaleDataError
counts. However, my gut has been wrong a few times (see issue history), while @willkg identified the root cause a month ago.For the db connection thing, I think writing a new fixture that has the properties you need for a rollback test sounds like the right thing to do if that's possible.
If the retry decorator we have in Antenna and Socorro isn't the right shape, it's not hard to write and test retry things, so we can roll our own here that has the shape we need. I like having retry logic as a library function rather than intertwined with the retryable code, but maybe it's not helpful to do that here.
Trying a select ... for update
sounds fine. If we increase the batch size, does that increase contention issues? Right now if processing hits problems, I think it dumps the data. It'd be nice if it dumped less data.
A lot of the last round of work was on issue #990, which covered the StaleDataError
issue. Early locking with SELECT ... WITH UPDATE
reduces the situations where a deadlock can occur.
We now have a retry decorator, using backoff.on_exception
as the engine, that is used for station (bluetooth, wifi, and cel)l, datamap, and cell area updating.
We're tuning the sample rates in production and stage, to ensure the processing pipelines can keep up and Redis isn't under pressure to eject data.
We're using 1% for high-volume keys in production, and seeing 5-20 retried lock contentions per hour. This data is processed in production and sent to stage.
In stage, 50% is enough to saturate the pipeline from work day traffic, and to generate lots of locking contention. The pipeline can keep up at 25%, with 1-10 lock contentions per hour. At 1%, errors go away entirely.
We can now monitor the metrics data.station.dberror
and datamaps.dberror
, to see how many lock contentions there are. If these increase to the point where tasks start failing, we can increase the lock timeout, increase the retry count, or reduce the sample rates.
I think there is more that could be done to make processing more efficient, but the processing pipeline is stable, so I'm closing the issue.
We're testing a new infrastructure, 4 years of changes, and an Python 2 -> 3 migration all on stage right now. One of the things we did was add a geosubmit export to prod which points to stage, so stage has a steady stream of incoming reports to process.
One of the errors we're hitting is this:
https://sentry.prod.mozaws.net/operations/location-stage/issues/6582659/
The issue is coming from this line:
https://github.com/mozilla/ichnaea/blob/4b7aa181ddd708c3f0ab9fa6031c49a707d07d69/ichnaea/data/station.py#L530
It's happening between 25-30 times per hour.
We're not seeing these errors in prod. Maybe this is caused by different code, dependencies, or data context between stage and prod?
This issue covers looking into this.
This might be related to #990.