Open hackartisan opened 3 years ago
The firewall timeout was updated this afternoon from 12 to 24 hours.
We'll pursue turning that firewall timeout off for traffic from our VMs
This still happening: https://app.honeybadger.io/projects/54497/faults/80884987#comment_15950891
Still happening as on 8/5/2022 https://app.honeybadger.io/projects/54497/faults/80884987#comment_15950891.
Do we still want to do this? I believe we've moved a new environment this past year.
Honeybadger
Background
In response to https://github.com/pulibrary/bibdata/issues/1514 and https://github.com/pulibrary/bibdata/issues/1531 we moved bibdata alma boxes to point to the old postgres machine. This was to buy us time to figure out the reason we we losing connections. We want to move back to the new postgres box.
Process
We'd like to write a script that opens a database connection, sleeps for an increasing amount of time, and then writes to the connection. We can use this to pinpoint exactly when the connection is timing out, and try it under various conditions / environments.
Theory
Discussions on the ops team yielded the following theory:
Based on the error we got during indexing
For prod: The errors in are coming from bibdata-alma-worker{1, 2}. bibdata-alma-worker3 was created 7/7, which is before these honeybadger errors. So that does support this theory since that one wasn't erroring.
For staging: The errors are coming from bibdata-alma-worker-staging1 which also supports this theory.
Another interesting pattern is that the error always came from the worker boxes. Weren't all the boxes running sidekiq at the time? Why would the web server boxes not have hit this error?
Note that the staging db was later moved down to lib-postgres3, but prod was not.
Based on the error we got while downloading the scsb files
We got lost connections from both bibdata-alma-staging1 and bibdata-alma-worker-staging1. The database was switched between these two errors. So these errors also support the theory.