pulibrary / bibdata

Local API for retrieving bibliographic and other useful data from Alma (Ruby 3.2.0, Rails 7.1.3.4)
BSD 2-Clause "Simplified" License
16 stars 7 forks source link

Troubleshoot connections on the new postgres box (pg13) #1566

Open hackartisan opened 3 years ago

hackartisan commented 3 years ago

Honeybadger

Background

In response to https://github.com/pulibrary/bibdata/issues/1514 and https://github.com/pulibrary/bibdata/issues/1531 we moved bibdata alma boxes to point to the old postgres machine. This was to buy us time to figure out the reason we we losing connections. We want to move back to the new postgres box.

Process

We'd like to write a script that opens a database connection, sleeps for an increasing amount of time, and then writes to the connection. We can use this to pinpoint exactly when the connection is timing out, and try it under various conditions / environments.

Theory

Discussions on the ops team yielded the following theory:

Based on the error we got during indexing

For prod: The errors in are coming from bibdata-alma-worker{1, 2}. bibdata-alma-worker3 was created 7/7, which is before these honeybadger errors. So that does support this theory since that one wasn't erroring.

For staging: The errors are coming from bibdata-alma-worker-staging1 which also supports this theory.

Another interesting pattern is that the error always came from the worker boxes. Weren't all the boxes running sidekiq at the time? Why would the web server boxes not have hit this error?

Note that the staging db was later moved down to lib-postgres3, but prod was not.

Based on the error we got while downloading the scsb files

We got lost connections from both bibdata-alma-staging1 and bibdata-alma-worker-staging1. The database was switched between these two errors. So these errors also support the theory.

hackartisan commented 3 years ago

The firewall timeout was updated this afternoon from 12 to 24 hours.

hackartisan commented 3 years ago

We'll pursue turning that firewall timeout off for traffic from our VMs

tpendragon commented 3 years ago

This still happening: https://app.honeybadger.io/projects/54497/faults/80884987#comment_15950891

kevinreiss commented 2 years ago

Still happening as on 8/5/2022 https://app.honeybadger.io/projects/54497/faults/80884987#comment_15950891.

kevinreiss commented 9 months ago

Do we still want to do this? I believe we've moved a new environment this past year.