Investigate PostgreSQL connection errors

jpmckinney commented 5 months ago

On ocp13, there are log entries in /var/log/postgresql/postgresql-12-main.log like:

2024-01-20 22:05:47.271 UTC [2001370] [unknown]@[unknown] LOG:  could not accept SSL connection: Success
2024-01-20 22:05:47.273 UTC [2001383] kingfisher_process@kingfisher_process FATAL:  no pg_hba.conf entry for host "172.18.0.7", user "kingfisher_process", database "kingfisher_process", SSL off

These coincide with errors in Sentry for kingfisher-process like:

connection to server at "host.docker.internal" (172.17.0.1), port 5432 failed: SSL error: unsupported method
connection to server at "host.docker.internal" (172.17.0.1), port 5432 failed: FATAL:  no pg_hba.conf entry for host "172.18.0.7", user "kingfisher_process", database "kingfisher_process", SSL off

The connection string for the app is postgresql://kingfisher_process:PASSWORD@host.docker.internal:5432/kingfisher_process

Does /etc/postgresql/12/main/pg_hba.conf need to be changed? Its template is https://github.com/open-contracting/deploy/blob/main/salt/postgres/files/pg_hba.conf#L5-L18

RobHooper commented 1 month ago

I have just tested psql connections from a container on ocp13 and I can connect successfully. The Docker IP range is not explicitly allowed in pg_hba.conf however postgres.public_access is enabled on this server which allows these connections. I would also expect to see all connections fail if pg_hba.conf was the cause - this sounds like an intermittent issue.

These connections are failing is because of an SSL error (SSL error: unsupported method), when this happens it then falls back to "SSL disabled" which is blocked by pg_hba.conf.

One explanation could be resource exhaustion / server load, do these errors line up with alerts in Prometheus?

jpmckinney commented 1 month ago

I checked the Prometheus graphs for 2024-01-20, and it looks like normal load.

It does seem like an intermittent issue. We had seen it a few times, but not again since I reported this issue. I'll close it for now. We can re-open if it happens again (as we'll have corresponding logs to investigate with).

jpmckinney commented 3 days ago

@RobHooper This occurred yesterday on ocp13. You can see the "OperationalError" issues in Sentry for the registry-kingfisher-process project: https://open-contracting-partnership.sentry.io/issues/?notification_uuid=45e0f203-56e9-4e7e-a9b4-6211d0174981&project=5857059&query=is%3Aunresolved&referrer=issue-list&statsPeriod=14d

jpmckinney commented 3 days ago

I can confirm that it's happening presently and all 16 cores are at 100%.

Edit: I notice I can possibly save 50% of the queries from Kingfisher Process, so will deploy that in a bit. open-contracting/kingfisher-process@122f9283bc6349227e3c510a53635eda14c60368

Edit2: By looking at the active queries, I saw another spot where the same query was being made repetitively. This was a bit trickier to fix and might leave a message stuck in the queue. open-contracting/kingfisher-process@8125a3b0c204a5f5c11fbbd047fef95ac5fa3757

jpmckinney commented 2 days ago

Okay, load isn't 100% on all CPUs now, so I'll re-close.

open-contracting / deploy

Investigate PostgreSQL connection errors #482