Closed malinah closed 6 months ago
PGbouncer is not related. Why? Because up until today nothing even uses pgbouncer/pooler yet due to a small design mistake.
Yeah, I know. I reported to your team that the pooler POD might not be in the connection chain.
Nonetheless, when I dug deeper into this issue I found that the POD cnpg-main-rw is running pgbouncer PROCESS as well.
# k3s kubectl get pod -n ix-outline | grep cnpg | cut -f1 -d' ' | while read pod; do echo "\n$pod"; k3s kubectl exec pod/$pod -n ix-outline -- /bin/sh -c 'ls -ald /proc/*/exe' | grep -e /pgbouncer -e /postgres; done
outline-cnpg-main-pooler-rw-768d84474c-7vp6r
Defaulted container "pgbouncer" out of: pgbouncer, bootstrap-controller (init)
lrwxrwxrwx 1 pgbouncer pgbouncer 0 Jan 9 14:37 /proc/20/exe -> /usr/bin/pgbouncer
outline-cnpg-main-rw-6d8785fbb7-7krrh
Defaulted container "pgbouncer" out of: pgbouncer, bootstrap-controller (init)
lrwxrwxrwx 1 pgbouncer pgbouncer 0 Jan 9 14:33 /proc/20/exe -> /usr/bin/pgbouncer
outline-cnpg-main-1
Defaulted container "postgres" out of: postgres, bootstrap-controller (init)
lrwxrwxrwx 1 postgres tape 0 Jan 9 14:34 /proc/28/exe -> /usr/lib/postgresql/16/bin/postgres
lrwxrwxrwx 1 postgres tape 0 Jan 9 14:34 /proc/29/exe -> /usr/lib/postgresql/16/bin/postgres
lrwxrwxrwx 1 postgres tape 0 Jan 9 14:34 /proc/30/exe -> /usr/lib/postgresql/16/bin/postgres
lrwxrwxrwx 1 postgres tape 0 Jan 9 14:34 /proc/31/exe -> /usr/lib/postgresql/16/bin/postgres
lrwxrwxrwx 1 postgres tape 0 Jan 9 14:34 /proc/42/exe -> /usr/lib/postgresql/16/bin/postgres
lrwxrwxrwx 1 postgres tape 0 Jan 9 14:34 /proc/43/exe -> /usr/lib/postgresql/16/bin/postgres
lrwxrwxrwx 1 postgres tape 0 Jan 9 14:34 /proc/44/exe -> /usr/lib/postgresql/16/bin/postgres
lrwxrwxrwx 1 postgres tape 0 Jan 9 14:34 /proc/45/exe -> /usr/lib/postgresql/16/bin/postgres
lrwxrwxrwx 1 postgres tape 0 Jan 9 14:49 /proc/80924/exe -> /usr/lib/postgresql/16/bin/postgres
lrwxrwxrwx 1 postgres tape 0 Jan 9 14:50 /proc/81385/exe -> /usr/lib/postgresql/16/bin/postgres
Let me reiterate one point again. I tried to connect to each of these pods with Adminer using the same credentials while the Outline was in CrashLoopBackOff state due to database connection error. This is how it went:
pod/outline-cnpg-main-pooler-rw-768d84474c-7vp6r pgbouncer server login has been failing, try again later
pod/outline-cnpg-main-rw-6d8785fbb7-7krrh pgbouncer timeout
pod/outline-cnpg-main-1 postgres OK, connected
And since the database URL outline-cnpg-main-rw.ix-outline.svc.cluster.local points to the cnpg-main-rw SERVICE which points to the cnpg-main-rw POD, as described in the first message of this issue, this makes me believe there really is a pgbouncer PROCESS in the way.
Also the Outline log shows this message login has been failing, try again later
which is unique for the pgbouncer I believe. https://github.com/pgbouncer/pgbouncer/blob/master/src/objects.c#L815
Anyway. Im learning about this as I go and I don't really know how to diagnose this better. So any help would be appreciated.
Thanks
We've no control over what runs or does not run in any CNPG pod. So it's not worth our time to overdiagnose issues related to them.
read your logs. Issue is a simple case of wrong username and password.
read your logs. Issue is a simple case of wrong username and password.
That's a residue of my manual connection testing.
That explanation doesn't match with the pgbouncer error message in Outline logs. It doesn't explain why it's possible to connect directly with Adminer. Nor why would Outline changed db creds at random times and maybe after 10 crash back loops managed to recover.
We've no control over what runs or does not run in any CNPG pod. So it's not worth our time to overdiagnose issues related to them.
Cool.
This no longer seems to be an issue.
App Name
Outline
Operating System
TrueNAS SCALE 23.10.1
App Version
0.74.0 11.1.8
Application Events
Application Logs
Application Configuration
https://discord.com/channels/830763548678291466/1192562154121986098/1192570440850362378
Describe the bug
Outline stops working after undetermined amount of time. Sometimes it's hours, sometime it's days.
The Outline pod log says it cannot connect to the database and keeps re-crashing.
After manual examination I found that in the crashing state it is still possible to connect to the pgsql server, just not through the pgbouncer. Details are provided in the Addition Context section.
This does happen only in Outline chart. Other charts with CNPG work fine.
To Reproduce
Install Outline. Setup OpenID and S3. Create document and wait.
Expected Behavior
Doesn't crash.
Screenshots
-
Additional Context
This is the state the chart got stuck in:
Chain of IPs for the recommended way to connect to the chart's db:
outline-cnpg-main-rw.ix-outline.svc.cluster.local 172.17.194.185
service/outline-cnpg-main-rw ClusterIP 172.17.194.185 <none> 5432/TCP 174m
Connecting with Adminer directly to these pods results:
I've read and agree with the following