Open williamstein opened 2 years ago
I saw something similar today. Migrating a database (essentially restarting with a delay) and websocket hub's were not able to properly reconnect. The log says this, and well, in particular, the oddest thing is to state the DB connection is OK and then throwing such an exception:
2023-03-23T08:59:35.091Z: cocalc:debug:hub:healthcheck process_health_check: no DB connection problems – OK
2023-03-23T08:59:35.648Z: cocalc:error:hub BUG ****************************************************************************
2023-03-23T08:59:35.648Z: cocalc:error:hub Uncaught exception: Error: SASL: SCRAM-SERVER-FIRST-MESSAGE: client password must be a string
Error: SASL: SCRAM-SERVER-FIRST-MESSAGE: client password must be a string
at Object.continueSession (/cocalc/src/packages/node_modules/.pnpm/pg@8.8.0/node_modules/pg/lib/sasl.js:24:11)
at Client._handleAuthSASLContinue (/cocalc/src/packages/node_modules/.pnpm/pg@8.8.0/node_modules/pg/lib/client.js:257:10)
at Connection.emit (node:events:513:28)
at /cocalc/src/packages/node_modules/.pnpm/pg@8.8.0/node_modules/pg/lib/connection.js:114:12
at Parser.parse (/cocalc/src/packages/node_modules/.pnpm/pg-protocol@1.5.0/node_modules/pg-protocol/src/parser.ts:104:9)
at Socket.<anonymous> (/cocalc/src/packages/node_modules/.pnpm/pg-protocol@1.5.0/node_modules/pg-protocol/src/index.ts:7:48)
at Socket.emit (node:events:513:28)
at addChunk (node:internal/streams/readable:315:12)
at readableAddChunk (node:internal/streams/readable:289:9)
at Socket.Readable.push (node:internal/streams/readable:228:10)
2023-03-23T08:59:35.655Z: cocalc:error:hub Error: SASL: SCRAM-SERVER-FIRST-MESSAGE: client password must be a string
at Object.continueSession (/cocalc/src/packages/node_modules/.pnpm/pg@8.8.0/node_modules/pg/lib/sasl.js:24:11)
at Client._handleAuthSASLContinue (/cocalc/src/packages/node_modules/.pnpm/pg@8.8.0/node_modules/pg/lib/client.js:257:10)
at Connection.emit (node:events:513:28)
at /cocalc/src/packages/node_modules/.pnpm/pg@8.8.0/node_modules/pg/lib/connection.js:114:12
at Parser.parse (/cocalc/src/packages/node_modules/.pnpm/pg-protocol@1.5.0/node_modules/pg-protocol/src/parser.ts:104:9)
at Socket.<anonymous> (/cocalc/src/packages/node_modules/.pnpm/pg-protocol@1.5.0/node_modules/pg-protocol/src/index.ts:7:48)
at Socket.emit (node:events:513:28)
at addChunk (node:internal/streams/readable:315:12)
at readableAddChunk (node:internal/streams/readable:289:9)
at Socket.Readable.push (node:internal/streams/readable:228:10)
2023-03-23T08:59:35.655Z: cocalc:error:hub BUG ****************************************************************************
similarly, with a slightly different exception
2023-03-23T08:59:25.092Z: cocalc:debug:hub:healthcheck process_health_check: no DB connection problems – OK
2023-03-23T08:59:25.747Z: cocalc:debug:primus Client(H9k5Yb4CbnDVwMk1gThhZ).push_to_client: [0 mesg_time_ms] [0 mesg_avg_ms] -- mesg.id=817c8bf6-1e74-4a3b-8cb2-35c5eac0b51d
ALL clients failed [
Error: Connection terminated unexpectedly
at Connection.<anonymous> (/cocalc/src/packages/node_modules/.pnpm/pg@8.8.0/node_modules/pg/lib/client.js:132:73)
at Object.onceWrapper (node:events:627:28)
at Connection.emit (node:events:513:28)
at Socket.<anonymous> (/cocalc/src/packages/node_modules/.pnpm/pg@8.8.0/node_modules/pg/lib/connection.js:107:12)
at Socket.emit (node:events:525:35)
at endReadableNT (node:internal/streams/readable:1358:12)
at processTicksAndRejections (node:internal/process/task_queues:83:21)
]
Failed to connect to database -- ALL clients failed to connect
2023-03-23T08:59:29.888Z: cocalc:debug:hub:healthcheck process_health_check: concurrent 0 < NaN – OK
2023-03-23T08:59:29.888Z: cocalc:debug:hub:healthcheck uptime 1m37s – draining in 5h30m41s – terminating in 6h0m41s
2023-03-23T08:59:29.888Z: cocalc:debug:hub:healthcheck process_health_check: uptime 1m37s – draining in 5h30m41s – terminating in 6h0m41s – OK
2023-03-23T08:59:29.888Z: cocalc:debug:hub:healthcheck process_health_check: no DB connection problems – OK
today I saw this
ALL clients failed [
/cocalc/src/packages/node_modules/.pnpm/pg-protocol@1.5.0/node_modules/pg-protocol/src/parser.ts:369
name === 'notice' ? new NoticeMessage(length, messageValue) : new DatabaseError(messageValue, length, name)
^
error: password authentication failed for user "smc"
at Parser.parseErrorMessage (/cocalc/src/packages/node_modules/.pnpm/pg-protocol@1.5.0/node_modules/pg-protocol/src/parser.ts:369:69)
at Parser.handlePacket (/cocalc/src/packages/node_modules/.pnpm/pg-protocol@1.5.0/node_modules/pg-protocol/src/parser.ts:188:21)
at Parser.parse (/cocalc/src/packages/node_modules/.pnpm/pg-protocol@1.5.0/node_modules/pg-protocol/src/parser.ts:103:30)
at Socket.<anonymous> (/cocalc/src/packages/node_modules/.pnpm/pg-protocol@1.5.0/node_modules/pg-protocol/src/index.ts:7:48)
at Socket.emit (node:events:513:28)
at addChunk (node:internal/streams/readable:315:12)
at readableAddChunk (node:internal/streams/readable:289:9)
at Socket.Readable.push (node:internal/streams/readable:228:10)
at TCP.onStreamRead (node:internal/stream_base_commons:190:23) {
length: 99,
severity: 'FATAL',
code: '28P01',
detail: undefined,
hint: undefined,
position: undefined,
internalPosition: undefined,
internalQuery: undefined,
where: undefined,
schema: undefined,
table: undefined,
column: undefined,
dataType: undefined,
constraint: undefined,
file: 'auth.c',
line: '331',
routine: 'auth_failed'
}
]
Failed to connect to database -- ALL clients failed to connect
Rather than retrying, in the case when the response from PostgreSQL is that the password is wrong, then the only sane course of action is shutting down. This isn't a temporary error, e.g., due to network problems. The design where it doesn't shut down in such a situation leads to cocalc servers not probably being detected as bad by Kubernetes.
A deeper mystery is how this situation could ever come up... and I don't understand that. It might be that PostgreSQL just has a security mechanism where it denies even correct passwords in some cases. Alternatively, maybe in the context of kubernetes, our servers sometimes fall back to a bad password (e.g., blank) in some edge case, and then of course the right solution would be to kill and restart such pods. The health checks will only accomplish this if the server terminates.