OperationalError: (psycopg2.OperationalError) SSL error: decryption failed or bad record mac

sentry-io[bot] commented 3 years ago

OperationalError: SSL error: decryption failed or bad record mac

  File "sqlalchemy/engine/base.py", line 1277, in _execute_context
    cursor, statement, parameters, context
  File "sqlalchemy/engine/default.py", line 593, in do_execute
    cursor.execute(statement, parameters)
  File "newrelic/hooks/database_psycopg2.py", line 65, in execute
    **kwargs)
  File "newrelic/hooks/database_dbapi2.py", line 39, in execute
    *args, **kwargs)

OperationalError: (psycopg2.OperationalError) SSL error: decryption failed or bad record mac

[SQL: INSERT INTO urlhaus_rule (id, hash, rule) VALUES (%(id_m0)s, %(hash_m0)s, %(rule_m0)s), (%(id_m1)s, %(hash_m1)s, %(rule_m1)s), (%(id_m2)s, %(hash_m2)s, %(rule_m2)s), (%(id_m3)s, %(hash_m3)s, %(rule_m3)s), (%(id_m4)s, %(hash_m4)s, %(rule_m4)s), (%(id_m5)s, %(hash_m5)s, %(rule_m5)s), (%(id_m6)s, %(hash_m6)s, %(rule_m6)s), (%(id_m7)s, %(hash_m7)s, %(rule_m7)s), (%(id_m8)s, %(hash_m8)s, %(rule_m8)s), (%(id_m9)s, %(hash_m9)s, %(...
(11 additional frame(s) were not displayed)
...
  File "sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "sqlalchemy/engine/base.py", line 1277, in _execute_context
    cursor, statement, parameters, context
  File "sqlalchemy/engine/default.py", line 593, in do_execute
    cursor.execute(statement, parameters)
  File "newrelic/hooks/database_psycopg2.py", line 65, in execute
    **kwargs)
  File "newrelic/hooks/database_dbapi2.py", line 39, in execute
    *args, **kwargs)

jon-betts commented 3 years ago

Some possible thoughts on this: https://medium.com/@philamersune/fixing-ssl-error-decryption-failed-or-bad-record-mac-d668e71a5409

This seems to be to do with starting a connection in one thread, and using it in another.

jon-betts commented 3 years ago

So looking at the trace above, this isn't coming from our code at all, but from the New Relic introspection of our tasks. This is where the other thread is. So my guess is if we get a reporting period from New Relic while we are trying to update, it explodes.

We'll probably have to do a minor version of what we did in the WebSocket to turn off transaction tracing:

https://docs.newrelic.com/docs/agents/python-agent/configuration/python-agent-configuration#txn-tracer-settings

jon-betts commented 3 years ago

I've tried disabling New Relic transaction monitoring, but we're going to have to wait and see if anything happens.

jon-betts commented 3 years ago

This is still working with the transaction monitoring disabled, but New Relic is still monkey patching Postgres, so we're going to try disabling it completely and see what happens.

jon-betts commented 3 years ago

Currently deploying the PR which removes New Relic, just to rule it out of being the source. Unfortunately we aren't going to know for a while / ever, because this is very intermittent. We can go a whole week without seeing this sometimes.

jon-betts commented 3 years ago

I've deployed this now and closed the Sentry issue, but we're going to have to check on this for at least a week if it doesn't just show up immediately.

seanh commented 3 years ago

@jon-betts Re-opening this because it happened again: https://sentry.io/organizations/hypothesis/issues/2086421688/ (latest event is today just after midnight, 12:01:08.

It looks like New Relic is innocent here, and we should revert https://github.com/hypothesis/checkmate/pull/87 and https://github.com/hypothesis/checkmate/pull/83?

jon-betts commented 3 years ago

Yep

hypothesis / checkmate

OperationalError: (psycopg2.OperationalError) SSL error: decryption failed or bad record mac #60