OperationalError: pymysql.err.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction' ...

We're testing a new infrastructure, 4 years of changes, and an Python 2 -> 3 migration all on stage right now. One of the things we did was add a geosubmit export to prod which points to stage, so stage has a steady stream of incoming reports to process.

One of the errors we're hitting is this:

OperationalError: (1213, 'Deadlock found when trying to get lock; try restarting transaction')
  File "sqlalchemy/engine/base.py", line 1229, in _execute_context
    cursor, statement, parameters, context
  File "sqlalchemy/dialects/mysql/mysqldb.py", line 148, in do_executemany
    rowcount = cursor.executemany(statement, parameters)
  File "pymysql/cursors.py", line 199, in executemany
    self.rowcount = sum(self.execute(query, arg) for arg in args)
  File "pymysql/cursors.py", line 199, in <genexpr>
    self.rowcount = sum(self.execute(query, arg) for arg in args)
  File "pymysql/cursors.py", line 170, in execute
    result = self._query(query)
  File "pymysql/cursors.py", line 328, in _query
    conn.query(q)
  File "pymysql/connections.py", line 517, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "pymysql/connections.py", line 732, in _read_query_result
    result.read()
  File "pymysql/connections.py", line 1075, in read
    first_packet = self.connection._read_packet()
  File "pymysql/connections.py", line 684, in _read_packet
    packet.check_error()
  File "pymysql/protocol.py", line 220, in check_error
    err.raise_mysql_exception(self._data)
  File "pymysql/err.py", line 109, in raise_mysql_exception
    raise errorclass(errno, errval)

OperationalError: (pymysql.err.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction')
[SQL: UPDATE wifi_shard_8 SET max_lat=%(max_lat)s, min_lat=%(min_lat)s, max_lon=%(max_lon)s, min_lon=%(min_lon)s, lat=%(lat)s, lon=%(lon)s, modified=%(modified)s, radius=%(radius)s, region=%(region)s, samples=%(samples)s, source=%(source)s, weight=%(weight)s, last_seen=%(last_seen)s, block_first=%(block_first)s, block_last=%(block_last)s, block_count=%(block_count)s WHERE wifi_shard_8.mac = %(wifi_shard_8_mac)s]
[parameters: (REDACTED)}  ... displaying 10 of 432 tota...
  File "ichnaea/taskapp/task.py", line 79, in __call__
    result = super(BaseTask, self).__call__(*args, **kw)
  File "celery/app/trace.py", line 648, in __protected_call__
    return self.run(*args, **kwargs)
  File "ichnaea/data/tasks.py", line 145, in update_wifi
    station.WifiUpdater(self, shard_id=shard_id)()
  File "ichnaea/data/station.py", line 566, in __call__
    session, shard, shard_values, stats_counter
  File "ichnaea/data/station.py", line 530, in update_shard
    session.bulk_update_mappings(shard, updates)
  File "sqlalchemy/orm/session.py", line 2860, in bulk_update_mappings
    mapper, mappings, True, False, False, False, False
  File "sqlalchemy/orm/session.py", line 2899, in _bulk_save_mappings
    transaction.rollback(_capture_exception=True)
  File "sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "sqlalchemy/util/compat.py", line 153, in reraise
    raise value
  File "sqlalchemy/orm/session.py", line 2884, in _bulk_save_mappings
    update_changed_only,
  File "sqlalchemy/orm/persistence.py", line 180, in _bulk_update
    bookkeeping=False,
  File "sqlalchemy/orm/persistence.py", line 996, in _emit_update_statements
    statement, multiparams
  File "sqlalchemy/engine/base.py", line 988, in execute
    return meth(self, multiparams, params)
  File "sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
    distilled_params,
  File "sqlalchemy/engine/base.py", line 1253, in _execute_context
    e, statement, parameters, cursor, context
  File "sqlalchemy/engine/base.py", line 1473, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "sqlalchemy/util/compat.py", line 398, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "sqlalchemy/util/compat.py", line 152, in reraise
    raise value.with_traceback(tb)
  File "sqlalchemy/engine/base.py", line 1229, in _execute_context
    cursor, statement, parameters, context
  File "sqlalchemy/dialects/mysql/mysqldb.py", line 148, in do_executemany
    rowcount = cursor.executemany(statement, parameters)
  File "pymysql/cursors.py", line 199, in executemany
    self.rowcount = sum(self.execute(query, arg) for arg in args)
  File "pymysql/cursors.py", line 199, in <genexpr>
    self.rowcount = sum(self.execute(query, arg) for arg in args)
  File "pymysql/cursors.py", line 170, in execute
    result = self._query(query)
  File "pymysql/cursors.py", line 328, in _query
    conn.query(q)
  File "pymysql/connections.py", line 517, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "pymysql/connections.py", line 732, in _read_query_result
    result.read()
  File "pymysql/connections.py", line 1075, in read
    first_packet = self.connection._read_packet()
  File "pymysql/connections.py", line 684, in _read_packet
    packet.check_error()
  File "pymysql/protocol.py", line 220, in check_error
    err.raise_mysql_exception(self._data)
  File "pymysql/err.py", line 109, in raise_mysql_exception
    raise errorclass(errno, errval)

https://sentry.prod.mozaws.net/operations/location-stage/issues/6582659/

The issue is coming from this line:

https://github.com/mozilla/ichnaea/blob/4b7aa181ddd708c3f0ab9fa6031c49a707d07d69/ichnaea/data/station.py#L530

It's happening between 25-30 times per hour.

We're not seeing these errors in prod. Maybe this is caused by different code, dependencies, or data context between stage and prod?

This issue covers looking into this.

This might be related to #990.

I've seen a couple of these happen in prod, but they're more like once a month. When they happen, we see it as an InternalError rather than an OperationalError. I think that's because we've updated dependencies and the exception changed.

In any case, we think all evidence suggests this happens in prod, but at a much reduced frequency because prod has data that's more stable, so there's less likelihood of multiple tasks operating on the same data hitting locking issues.

We're going to push off looking into this further until after the prod migration.

These are more frequent in production, so it appears to be the new stack. However, many things changed at once, so it is hard to pin down the cause:

MySQL 5.6 to 5.7.26
PyMySQL 0.7.11 to 0.9.3
SQLAlchemy 1.0.17 to 1.3.10
Celery 3.1.25 to 4.3.0
Python 2.6 to 3.7.4

I spent some time looking at the BINARY fields, like wifi_shard_e.mac, used as primary keys. A change in PyMySQL required an extra option binary_prefix, to continue prefixing the values with _binary in the raw SQL. It is weird to work with binary fields in interactive sessions, but I believe we are using these correctly in the code, and there isn't an escaping bug in SQLAlchemy.

I investigated a deadlock with show engine innodb status \G. Here's some of the data:

------------------------
LATEST DETECTED DEADLOCK
------------------------
2019-12-03 20:26:46 0x2b5fdf082700
*** (1) TRANSACTION:
TRANSACTION 26842249, ACTIVE 13 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 146 lock struct(s), heap size 24784, 238 row lock(s), undo log entries 240
MySQL thread id 159, OS thread handle 47690764539648, query id 17922983 172.31.46.156 location-rw updating
UPDATE wifi_shard_5 SET max_lat=<omitted>... WHERE wifi_shard_5.mac = _binary'??]???'
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 231 page no 8162 n bits 216 index PRIMARY of table `location`.`wifi_shard_5` trx id 26842249 lock_mode X locks rec but not gap waiting
Record lock, heap no 103 PHYSICAL RECORD: n_fields 20; compact format; info bits 0
...

*** (2) TRANSACTION:
TRANSACTION 26842250, ACTIVE 13 sec starting index read
mysql tables in use 1, locked 1
213 lock struct(s), heap size 24784, 335 row lock(s), undo log entries 338
MySQL thread id 153, OS thread handle 47690763740928, query id 17924237 172.31.46.156 location-rw updating
UPDATE wifi_shard_5 SET max_lat=<omitted> ... WHERE wifi_shard_5.mac = _binary'?:^?-'
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 231 page no 8162 n bits 216 index PRIMARY of table `location`.`wifi_shard_5` trx id 26842250 lock_mode X locks rec but not gap
Record lock, heap no 103 PHYSICAL RECORD: n_fields 20; compact format; info bits 0
...

Record lock, heap no 104 PHYSICAL RECORD: n_fields 20; compact format; info bits 0
...

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 231 page no 7910 n bits 216 index PRIMARY of table `location`.`wifi_shard_5` trx id 26842250 lock_mode X locks rec but not gap waiting
Record lock, heap no 7 PHYSICAL RECORD: n_fields 20; compact format; info bits 0
...

*** WE ROLL BACK TRANSACTION (1)

These are challenging to read, but what I can read is:

There are a lot of locks in the transactions. One has 238 row locks, the other has 335. The station updating tasks attempt to process 1000 observations at a time, to read all the existing station data at once, and then to write them all back using INSERT and bulk_update_mappings at the end of processing. This seems to be an expected number of locks for updates and insertions.
There are two competing transactions, in the middle of updating different stations based on the WHERE clause, but working to get a lock on the same row.

My gut says that MySQL 5.7 has changed how locks are granted, and that it is no longer a good idea to to try to update several rows at a time using bulk_update_mappings, unless we can guarantee that only one update task runs at a time. However, I think we want to instead make it so many of these can run in parallel, so we can scale out our async workers in response to the incoming data.

Instead, I think the station updating tasks should loop over the the stations, running the equivalent of:

for each station:
  for a few attempts:
    try:
       START TRANSACTION;
       SELECT ... FOR UPDATE;
       UPDATE or INSERT or do nothing
       COMMIT;
    except DeadLock:
       attempt again, or raise if out of luck

Before starting to change code, I want to investigate some other things first:

Is there anything in the MySQL release notes that would suggest more deadlocks for our situation?
Anyone reporting similar issues with PyMySQL or SQLAlchemy?
How are transactions handled in Celery 4? Is there a bug where they span tasks?

I suspect the deadlocks are also the root cause of #990, but it will be easier to avoid and work around deadlocks than to determine why a multi-row update failed to update all rows.

One more thing to investigate. As mentioned in https://github.com/mozilla/ichnaea/issues/990#issuecomment-550016444, there is some code to retry for some errors:

https://github.com/mozilla/ichnaea/blob/c58d4aef87dc162aaaa90714ff4c6eb6cadc6b18/ichnaea/data/station.py#L571-L575

The retrial errors include deadlocks:

https://github.com/mozilla/ichnaea/blob/c58d4aef87dc162aaaa90714ff4c6eb6cadc6b18/ichnaea/data/station.py#L410-L412

The constants could change for MySQL versions, or exception details could be different between Python 2.6 and 3.7.

This last idea was fruitful. In PyMySQL 0.7.11, a deadlock was a pymysql.err.InternalError. In PyMySQL 0.8.0, it changed to pymysql.err.OperationalError:

https://github.com/PyMySQL/PyMySQL/blob/0f4d45e5a20b47959ba7d16f130cbc0c7ce8506c/CHANGELOG#L39

My guess is that the deadlock is just as frequent in the 1.5.x stack, but a retry handled most cases.

This should be a quick fix, but I'd like to add a metric to track deadlocks caught by this code.

Just some random thoughts from my end:

Bulk operations should still be a lot faster than doing individual updates. SQLAlchemy overhead, PyMySQL string generation and MySQL statement parsing used to result in a factor 10 speedup of the bulk updates, as long as those resulted in a executemany call on the dbapi level. But the bulk/batch must not be too large, or you run into tasks trying to update the same rows at the same time. My rule of thumb used to be: The station update task should take about 1-2 seconds. That used to result in a good balance between throughput and conflicts.
If concurrency is a problem, maybe it is time to increase the number of shards for the wifi tables. As a smaller update, you could decouple the Redis queue shards from the database table shards. For the wifi tables sort the incoming data into 256 shards instead of 16. For example taking the fifth and sixth hex digit instead of just the fifth to determine the shard. That way data is presorted into 256 independent queues with no risk of updating the same rows. Even if they operate on the same 16 tables.

New related issue from fix in stage:

AttributeError: 'StaleDataError' object has no attribute 'orig'

The handler needs to be smart enough to only handle wrapped database driver issues.

With these changes merged, stage has gone from 500 - 1000 uncaught deadlock errors per day to 0 for the station updating task. The new data.station.dberror metric is recording 3-7 deadlocks (1213) per hour and 0-3 lock timeouts (1205) per hour. It's not clear why the instance count would be 6x greater with the Sentry exception recorder versus the exception handler with a 1 second sleep, but I'll take a happy mystery.

The code hasn't been deployed to production yet. It does not appear to have an effect on the StaleDataError rate (issue #990).

There's some additional follow-on work that we could track in new bugs, or as part of #990:

In tests, the database connection is shared between test code, application code, and task code, in a way that makes it challenging to test rollback code. It would be clearer to have independent connections for rollback code tests, perhaps as alternate fixtures to the current setup for "regular" tests.
There are no existing handlers for the update tasks for cell areas (100 - 200 events per day) or data maps (0 -5 per day). @willkg suggested a retry decorator, which worked well on previous projects. I'd like to resolve the test harness issue first
I'd like to try SELECT ... FOR UPDATE, to lock the station rows until the station updating algorithm is done, to see if it has a (positive) effect on deadlocks and StaleDataError counts. However, my gut has been wrong a few times (see issue history), while @willkg identified the root cause a month ago.

For the db connection thing, I think writing a new fixture that has the properties you need for a rollback test sounds like the right thing to do if that's possible.

If the retry decorator we have in Antenna and Socorro isn't the right shape, it's not hard to write and test retry things, so we can roll our own here that has the shape we need. I like having retry logic as a library function rather than intertwined with the retryable code, but maybe it's not helpful to do that here.

Trying a select ... for update sounds fine. If we increase the batch size, does that increase contention issues? Right now if processing hits problems, I think it dumps the data. It'd be nice if it dumped less data.

A lot of the last round of work was on issue #990, which covered the StaleDataError issue. Early locking with SELECT ... WITH UPDATE reduces the situations where a deadlock can occur.

We now have a retry decorator, using backoff.on_exception as the engine, that is used for station (bluetooth, wifi, and cel)l, datamap, and cell area updating.

We're tuning the sample rates in production and stage, to ensure the processing pipelines can keep up and Redis isn't under pressure to eject data.

We're using 1% for high-volume keys in production, and seeing 5-20 retried lock contentions per hour. This data is processed in production and sent to stage.

In stage, 50% is enough to saturate the pipeline from work day traffic, and to generate lots of locking contention. The pipeline can keep up at 25%, with 1-10 lock contentions per hour. At 1%, errors go away entirely.

We can now monitor the metrics data.station.dberror and datamaps.dberror, to see how many lock contentions there are. If these increase to the point where tasks start failing, we can increase the lock timeout, increase the retry count, or reduce the sample rates.

I think there is more that could be done to make processing more efficient, but the processing pipeline is stable, so I'm closing the issue.

mozilla / ichnaea

OperationalError: pymysql.err.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction' ... #991