/healthz returning OK when DB is failing

hnb2 commented 1 month ago

Bug Description

Hi there, we are monitoring the status of our N8N instance through the /healthz endpoint, but the instance went down several time and we always got {"status": "ok"} with a 200. While if you go to the home page(just "/") you will see a stacktrace with a 503 error.

Stacktrace:

{"code":503,"message":"Database is not ready!","stacktrace":"ResponseError: Database is not ready!\n    at /home/bas/app_0b10e6b9-6de8-4823-9a3a-e47f2c6273d4/node_modules/n8n/src/AbstractServer.ts:131:34\n    at newFn (/home/bas/app_0b10e6b9-6de8-4823-9a3a-e47f2c6273d4/node_modules/express-async-errors/index.js:16:20)\n    at Layer.handle [as handle_request] (/home/bas/app_0b10e6b9-6de8-4823-9a3a-e47f2c6273d4/node_modules/express/lib/router/layer.js:95:5)\n    at trim_prefix (/home/bas/app_0b10e6b9-6de8-4823-9a3a-e47f2c6273d4/node_modules/express/lib/router/index.js:328:13)\n    at /home/bas/app_0b10e6b9-6de8-4823-9a3a-e47f2c6273d4/node_modules/express/lib/router/index.js:286:9\n    at Function.process_params (/home/bas/app_0b10e6b9-6de8-4823-9a3a-e47f2c6273d4/node_modules/express/lib/router/index.js:346:12)\n    at next (/home/bas/app_0b10e6b9-6de8-4823-9a3a-e47f2c6273d4/node_modules/express/lib/router/index.js:280:10)\n    at expressInit (/home/bas/app_0b10e6b9-6de8-4823-9a3a-e47f2c6273d4/node_modules/express/lib/middleware/init.js:40:5)\n    at newFn (/home/bas/app_0b10e6b9-6de8-4823-9a3a-e47f2c6273d4/node_modules/express-async-errors/index.js:16:20)\n    at Layer.handle [as handle_request] (/home/bas/app_0b10e6b9-6de8-4823-9a3a-e47f2c6273d4/node_modules/express/lib/router/layer.js:95:5)"}

Logs on stdout:

2024-08-01T15:39:59.227+04:00
Started with job ID: 161475 (Execution ID: 1470798)
2024-08-01T15:40:58.953+04:00
Error: Connection terminated unexpectedly
2024-08-01T15:40:58.954+04:00
Error: Connection terminated unexpectedly
2024-08-01T15:41:07.399+04:00
503 Database is not ready!
2024-08-01T15:42:07.398+04:00
503 Database is not ready!
2024-08-01T15:43:07.398+04:00
503 Database is not ready!
2024-08-01T15:44:07.399+04:00
503 Database is not ready!
2024-08-01T15:45:07.399+04:00
503 Database is not ready!
2024-08-01T15:46:07.399+04:00
503 Database is not ready!
2024-08-01T15:47:07.398+04:00
503 Database is not ready!
2024-08-01T15:48:07.398+04:00
503 Database is not ready!
2024-08-01T15:49:07.398+04:00

Immediately after the error, no jobs at all were being handled and they were all lost.

Thank you for your help, please let me know if you need any other details.

To Reproduce

Im not sure how to intentionally reproduce this, probably by shutting down the DB temporarily ? We are using Postgres 15 if it makes any difference.

Right after the incident I checked the metrics on the db and the application server and everything was fine, no overload or anything suspicious.

Expected behavior

I would expect the /healthz endpoint to return a different body and status to indicate the failure so our alerting can do its job.

Operating System

Linux

n8n Version

1.51.2

Node.js Version

18

Database

PostgreSQL

Execution mode

queue

netroy commented 1 month ago

we had to intentionally do this because when /healthz wasn't returning a 200 during migrations, and when the migrations were taking a long time, the orchestrator (like kubernetes) would mark the instance as unhealthy and kill the instance before migrations could finish. This led to the instance never managing get started. We understand that this is an anti-pattern, but until we have a better way to prevent kubernetes from killing instances during long running migrations, we can't change this.

hnb2 commented 1 month ago

Hi @netroy Thank you for the quick answer.

Do you suggest instead we monitor the home page (/) for something that is not 200 ?

I saved a curl response during the incident and we had this:

< HTTP/2 503

(keeping in my mind what you are saying about migrations)

netroy commented 1 month ago

We could add a /health/db endpoint that only returns 200 when the DB is ready. Would that help?

hnb2 commented 1 month ago

Hi @netroy sounds like a good idea.

Joffcom commented 1 month ago

Just adding the internal reference for this issue which is N8N-7547

janober commented 1 week ago

Fix got released with n8n@1.58.0

janober commented 1 week ago

Fix got released with n8n@1.58.0

n8n-io / n8n