lidofinance / lido-keys-api

Api for fetching node operators keys from modules
MIT License
19 stars 15 forks source link

Reconnect on read-only SQL replica #268

Open kucharskim opened 6 months ago

kucharskim commented 6 months ago

We have active-passive PostgreSQL setup and we failover our database instances, when there is a need for it. Today we tested a failover scenario in a controlled manner and lido-keys-api needed a manual restart, because of following errors:

k logs  deployment.apps/lido-keys-api-mainnet  --since=1m
{"context":{"name":"Update validators from ValidatorsRegistry"},"level":"warn","message":"Job terminated with an error"}
{"code":"25006","file":"utility.c","length":118,"level":"error","line":"242","message":"delete from \"consensus_validator\" - cannot execute DELETE in a read-only transaction","routine":"PreventCommandIfReadOnly","severity":"ERROR","stack":["DriverException: delete from \"consensus_validator\" - cannot execute DELETE in a read-only transaction\n    at PostgreSqlExceptionConverter.convertException (/app/lido-keys-api/node_modules/@mikro-orm/core/platforms/ExceptionConverter.js:8:16)\n    at PostgreSqlExceptionConverter.convertException (/app/lido-keys-api/node_modules/@mikro-orm/postgresql/PostgreSqlExceptionConverter.js:42:22)\n    at PostgreSqlDriver.convertException (/app/lido-keys-api/node_modules/@mikro-orm/core/drivers/DatabaseDriver.js:192:54)\n    at /app/lido-keys-api/node_modules/@mikro-orm/core/drivers/DatabaseDriver.js:196:24\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async SqlEntityManager.nativeDelete (/app/lido-keys-api/node_modules/@mikro-orm/core/EntityManager.js:587:21)\n    at async StorageService.deleteValidators (/app/lido-keys-api/node_modules/@lido-nestjs/validators-registry/dist/storage/storage.service.js:33:9)\n    at async em.transactional.isolationLevel (/app/lido-keys-api/node_modules/@lido-nestjs/validators-registry/dist/validators-registry.js:64:13)\n    at async fork.getConnection.transactional (/app/lido-keys-api/node_modules/@mikro-orm/core/EntityManager.js:496:29)\n    at async PostgreSqlConnection.transactional (/app/lido-keys-api/node_modules/@mikro-orm/knex/AbstractSqlConnection.js:36:25)\n\nprevious error: delete from \"consensus_validator\" - cannot execute DELETE in a read-only transaction\n    at Parser.parseErrorMessage (/app/lido-keys-api/node_modules/pg-protocol/dist/parser.js:287:98)\n    at Parser.handlePacket (/app/lido-keys-api/node_modules/pg-protocol/dist/parser.js:126:29)\n    at Parser.parse (/app/lido-keys-api/node_modules/pg-protocol/dist/parser.js:39:38)\n    at Socket.<anonymous> (/app/lido-keys-api/node_modules/pg-protocol/dist/index.js:11:42)\n    at Socket.emit (node:events:513:28)\n    at Socket.emit (node:domain:489:12)\n    at addChunk (node:internal/streams/readable:324:12)\n    at readableAddChunk (node:internal/streams/readable:297:9)\n    at Readable.push (node:internal/streams/readable:234:10)\n    at TCP.onStreamRead (node:internal/stream_base_commons:190:23)"]}

We would like that lido-keys-api automatically handles a flip of database instance from read-write (when it is still an active master) to a read-only replica (when previously active master, is not any more an active master), by tearing down all active SQL sessions, maybe sleeping for 5 seconds or so, and then re-establishing them from scratch.