Troubleshoot connections on the new postgres box (pg13)

hackartisan commented 3 years ago

Honeybadger

please resolve https://app.honeybadger.io/projects/54497/faults/80884987 when closing this ticket.

Background

In response to https://github.com/pulibrary/bibdata/issues/1514 and https://github.com/pulibrary/bibdata/issues/1531 we moved bibdata alma boxes to point to the old postgres machine. This was to buy us time to figure out the reason we we losing connections. We want to move back to the new postgres box.

Process

We'd like to write a script that opens a database connection, sleeps for an increasing amount of time, and then writes to the connection. We can use this to pinpoint exactly when the connection is timing out, and try it under various conditions / environments.

Theory

Discussions on the ops team yielded the following theory:

VM locations:
- bibdata-alma-staging1 is in New South
- bibdata-alma-worker-staging1 is in Forrestal
- bibdata-alma1 is in New South
- bibdata-alma2 is in Forrestal
- bibdata-alma-worker1 is in Forrestal
- bibdata-alma-worker2 is in Forrestal
- bibdata-alma-worker3 is in New South
- lib-postgres3 (pg10) machine is in Forrestal
- lib-postgres1 (pg13) machine is in is in New South
- lib-postgres-staging1 (pg13) staging machine is in Forrestal
The difference in physical location between the 2 postgres machines could have resulted in greater latency to the postgres 13 machine.

Based on the error we got during indexing

For prod: The errors in are coming from bibdata-alma-worker{1, 2}. bibdata-alma-worker3 was created 7/7, which is before these honeybadger errors. So that does support this theory since that one wasn't erroring.

For staging: The errors are coming from bibdata-alma-worker-staging1 which also supports this theory.

Another interesting pattern is that the error always came from the worker boxes. Weren't all the boxes running sidekiq at the time? Why would the web server boxes not have hit this error?

Note that the staging db was later moved down to lib-postgres3, but prod was not.

Based on the error we got while downloading the scsb files

We got lost connections from both bibdata-alma-staging1 and bibdata-alma-worker-staging1. The database was switched between these two errors. So these errors also support the theory.

hackartisan commented 3 years ago

The firewall timeout was updated this afternoon from 12 to 24 hours.

hackartisan commented 3 years ago

We'll pursue turning that firewall timeout off for traffic from our VMs