louislam / uptime-kuma

A fancy self-hosted monitoring tool
https://uptime.kuma.pet
MIT License
57.25k stars 5.15k forks source link

Looping Crash After Deleting Monitor #3797

Closed jjschwarz closed 1 year ago

jjschwarz commented 1 year ago

⚠️ Please verify that this bug has NOT been raised before.

🛡️ Security Policy

Description

I removed a single monitor from the system in the UI with the Delete button, after a few minutes of waiting I started getting WebSocket timeouts. I checked and saw the Kubernetes Pods was restarting in a loop

👟 Reproduction steps

Delete monitor and wait

👀 Expected behavior

Monitor disappears and system continues to function

😓 Actual Behavior

System crashes in repeated loop upon start

🐻 Uptime-Kuma Version

1.22.1

💻 Operating System and Arch

Docker image from docker.io/louislam/uptime-kuma

🌐 Browser

FireFox 117.0.1

🐋 Docker Version

K8S v1.26.7

🟩 NodeJS Version

16

📝 Relevant log output

==> Performing startup jobs and maintenance tasks
==> Starting application with user 0 group 0
Welcome to Uptime Kuma
Your Node.js version: 16
2023-09-24T08:37:45Z [SERVER] INFO: Welcome to Uptime Kuma
2023-09-24T08:37:45Z [SERVER] INFO: Node Env: production
2023-09-24T08:37:45Z [SERVER] INFO: Importing Node libraries
2023-09-24T08:37:45Z [SERVER] INFO: Importing 3rd-party libraries
2023-09-24T08:37:46Z [SERVER] INFO: Creating express and socket.io instance
2023-09-24T08:37:46Z [SERVER] INFO: Server Type: HTTP
2023-09-24T08:37:46Z [SERVER] INFO: Importing this project modules
2023-09-24T08:37:46Z [NOTIFICATION] INFO: Prepare Notification Providers
2023-09-24T08:37:46Z [SERVER] INFO: Version: 1.22.1
2023-09-24T08:37:46Z [DB] INFO: Data Dir: ./data/
2023-09-24T08:37:46Z [SERVER] INFO: Connecting to the Database
2023-09-24T08:37:47Z [DB] INFO: SQLite config:
[ { journal_mode: 'wal' } ]
[ { cache_size: -12000 } ]
2023-09-24T08:37:47Z [DB] INFO: SQLite Version: 3.41.1
2023-09-24T08:37:47Z [SERVER] INFO: Connected
2023-09-24T08:37:47Z [DB] INFO: Your database version: 10
2023-09-24T08:37:47Z [DB] INFO: Latest database version: 10
2023-09-24T08:37:47Z [DB] INFO: Database patch not needed
2023-09-24T08:37:47Z [DB] INFO: Database Patch 2.0 Process
2023-09-24T08:39:44Z [SERVER] INFO: Shutdown requested
2023-09-24T08:39:44Z [SERVER] INFO: Called signal: SIGTERM
2023-09-24T08:39:44Z [SERVER] INFO: Stopping all monitors
Trace: Error [ERR_SERVER_NOT_RUNNING]: Server is not running.
    at new NodeError (node:internal/errors:387:5)
    at Server.close (node:net:1804:12)
    at Object.onceWrapper (node:events:627:28)
    at Server.emit (node:events:525:35)
    at emitCloseNT (node:net:1857:8)
    at processTicksAndRejections (node:internal/process/task_queues:82:21) {
  code: 'ERR_SERVER_NOT_RUNNING'
}
    at process.<anonymous> (/app/server/server.js:1825:13)
    at process.emit (node:events:513:28)
    at emit (node:internal/process/promises:140:20)
    at processPromiseRejections (node:internal/process/promises:274:27)
    at processTicksAndRejections (node:internal/process/task_queues:97:32)
If you keep encountering errors, please report to https://github.com/louislam/uptime-kuma/issues
2023-09-24T08:39:46Z [DB] INFO: Closing the database
jjschwarz commented 1 year ago

Updated to v1.23.2 image and essentially the same results:

==> Performing startup jobs and maintenance tasks ==> Starting application with user 0 group 0 Welcome to Uptime Kuma Your Node.js version: 16.20.2 2023-09-24T08:49:55Z [SERVER] INFO: Welcome to Uptime Kuma 2023-09-24T08:49:55Z [SERVER] INFO: Node Env: production 2023-09-24T08:49:55Z [SERVER] INFO: Inside Container: true 2023-09-24T08:49:55Z [SERVER] INFO: Importing Node libraries 2023-09-24T08:49:55Z [SERVER] INFO: Importing 3rd-party libraries 2023-09-24T08:49:56Z [SERVER] INFO: Creating express and socket.io instance 2023-09-24T08:49:56Z [SERVER] INFO: Server Type: HTTP 2023-09-24T08:49:56Z [SERVER] INFO: Importing this project modules 2023-09-24T08:49:56Z [NOTIFICATION] INFO: Prepare Notification Providers 2023-09-24T08:49:56Z [SERVER] INFO: Version: 1.23.2 2023-09-24T08:49:56Z [DB] INFO: Data Dir: ./data/ 2023-09-24T08:49:56Z [SERVER] INFO: Connecting to the Database 2023-09-24T08:50:00Z [DB] INFO: SQLite config: [ { journal_mode: 'wal' } ] [ { cache_size: -12000 } ] 2023-09-24T08:50:00Z [DB] INFO: SQLite Version: 3.41.1 2023-09-24T08:50:00Z [SERVER] INFO: Connected 2023-09-24T08:50:00Z [DB] INFO: Your database version: 10 2023-09-24T08:50:00Z [DB] INFO: Latest database version: 10 2023-09-24T08:50:00Z [DB] INFO: Database patch not needed 2023-09-24T08:50:00Z [DB] INFO: Database Patch 2.0 Process 2023-09-24T08:50:00Z [DB] INFO: patch-add-invert-keyword.sql is not patched 2023-09-24T08:50:00Z [DB] INFO: patch-add-invert-keyword.sql is patching 2023-09-24T08:51:54Z [SERVER] INFO: Shutdown requested 2023-09-24T08:51:54Z [SERVER] INFO: Called signal: SIGTERM Trace: Error [ERR_SERVER_NOT_RUNNING]: Server is not running. at new NodeError (node:internal/errors:387:5) at Server.close (node:net:1804:12) at Object.onceWrapper (node:events:627:28) at Server.emit (node:events:525:35) at emitCloseNT (node:net:1857:8) at processTicksAndRejections (node:internal/process/task_queues:82:21) { code: 'ERR_SERVER_NOT_RUNNING' } at process. (/app/server/server.js:1886:13) at process.emit (node:events:513:28) at emit (node:internal/process/promises:140:20) at processPromiseRejections (node:internal/process/promises:274:27) at processTicksAndRejections (node:internal/process/task_queues:97:32) If you keep encountering errors, please report to https://github.com/louislam/uptime-kuma/issues

jjschwarz commented 1 year ago

I restored the database from the nightly backup and went into the working system. You can see in the attached logs where I delete the same monitor (25)...and then the system falls apart. After a period of time the pod fails. This is all of the log from the pod that first fails after the delete. Once that happens we are back in the loop.

How can I fix this so I can delete a monitor? For now I will revert to the last good backup and wait on suggestions.

==> Performing startup jobs and maintenance tasks ==> Starting application with user 0 group 0 Welcome to Uptime Kuma Your Node.js version: 16.20.2 2023-09-24T09:06:49Z [SERVER] INFO: Welcome to Uptime Kuma 2023-09-24T09:06:49Z [SERVER] INFO: Node Env: production 2023-09-24T09:06:49Z [SERVER] INFO: Inside Container: true 2023-09-24T09:06:49Z [SERVER] INFO: Importing Node libraries 2023-09-24T09:06:49Z [SERVER] INFO: Importing 3rd-party libraries 2023-09-24T09:06:50Z [SERVER] INFO: Creating express and socket.io instance 2023-09-24T09:06:50Z [SERVER] INFO: Server Type: HTTP 2023-09-24T09:06:50Z [SERVER] INFO: Importing this project modules 2023-09-24T09:06:50Z [NOTIFICATION] INFO: Prepare Notification Providers 2023-09-24T09:06:50Z [SERVER] INFO: Version: 1.23.2 2023-09-24T09:06:50Z [DB] INFO: Data Dir: ./data/ 2023-09-24T09:06:50Z [SERVER] INFO: Connecting to the Database 2023-09-24T09:06:50Z [DB] INFO: SQLite config: [ { journal_mode: 'wal' } ] [ { cache_size: -12000 } ] 2023-09-24T09:06:50Z [DB] INFO: SQLite Version: 3.41.1 2023-09-24T09:06:50Z [SERVER] INFO: Connected 2023-09-24T09:06:50Z [DB] INFO: Your database version: 10 2023-09-24T09:06:50Z [DB] INFO: Latest database version: 10 2023-09-24T09:06:50Z [DB] INFO: Database patch not needed 2023-09-24T09:06:50Z [DB] INFO: Database Patch 2.0 Process 2023-09-24T09:06:50Z [DB] INFO: patch-add-invert-keyword.sql is not patched 2023-09-24T09:06:50Z [DB] INFO: patch-add-invert-keyword.sql is patching 2023-09-24T09:06:50Z [DB] INFO: patch-add-invert-keyword.sql was patched successfully 2023-09-24T09:06:50Z [DB] INFO: patch-added-json-query.sql is not patched 2023-09-24T09:06:50Z [DB] INFO: patch-added-json-query.sql is patching 2023-09-24T09:06:50Z [DB] INFO: patch-added-json-query.sql was patched successfully 2023-09-24T09:06:50Z [DB] INFO: patch-added-kafka-producer.sql is not patched 2023-09-24T09:06:50Z [DB] INFO: patch-added-kafka-producer.sql is patching 2023-09-24T09:06:50Z [DB] INFO: patch-added-kafka-producer.sql was patched successfully 2023-09-24T09:06:50Z [DB] INFO: patch-add-certificate-expiry-status-page.sql is not patched 2023-09-24T09:06:50Z [DB] INFO: patch-add-certificate-expiry-status-page.sql is patching 2023-09-24T09:06:50Z [DB] INFO: patch-add-certificate-expiry-status-page.sql was patched successfully 2023-09-24T09:06:50Z [DB] INFO: patch-monitor-oauth-cc.sql is not patched 2023-09-24T09:06:50Z [DB] INFO: patch-monitor-oauth-cc.sql is patching 2023-09-24T09:06:50Z [DB] INFO: patch-monitor-oauth-cc.sql was patched successfully 2023-09-24T09:06:50Z [DB] INFO: patch-add-timeout-monitor.sql is not patched 2023-09-24T09:06:50Z [DB] INFO: patch-add-timeout-monitor.sql is patching 2023-09-24T09:06:50Z [DB] INFO: patch-add-timeout-monitor.sql was patched successfully 2023-09-24T09:06:50Z [DB] INFO: patch-add-gamedig-given-port.sql is not patched 2023-09-24T09:06:50Z [DB] INFO: patch-add-gamedig-given-port.sql is patching 2023-09-24T09:06:50Z [DB] INFO: patch-add-gamedig-given-port.sql was patched successfully 2023-09-24T09:06:50Z [DB] INFO: Database Patched Successfully 2023-09-24T09:06:50Z [SERVER] INFO: Load JWT secret from database. 2023-09-24T05:06:50-04:00 [SERVER] INFO: Adding route 2023-09-24T05:06:50-04:00 [SERVER] INFO: Adding socket handler 2023-09-24T05:06:50-04:00 [SERVER] INFO: Init the server 2023-09-24T05:06:50-04:00 [SERVER] INFO: Listening on 3001 2023-09-24T05:06:50-04:00 [SERVICES] INFO: Starting nscd 2023-09-24T05:07:06-04:00 [AUTH] INFO: Login by token. IP=172.16.121.65 2023-09-24T05:07:06-04:00 [AUTH] INFO: Username from JWT: admin 2023-09-24T05:07:06-04:00 [AUTH] INFO: Successfully logged in user admin. IP=172.16.121.65 2023-09-24T05:07:06-04:00 [MONITOR] WARN: Monitor #25 'Expense System K8S Production': Failing: connect ECONNREFUSED 192.168.1.112:30981 | Interval: 60 seconds | Type: http | Down Count: 0 | Resend Interval: 0 2023-09-24T05:07:32-04:00 [AUTH] INFO: Login by token. IP=172.16.121.65 2023-09-24T05:07:32-04:00 [AUTH] INFO: Username from JWT: admin 2023-09-24T05:07:32-04:00 [AUTH] INFO: Successfully logged in user admin. IP=172.16.121.65 2023-09-24T05:07:46-04:00 [AUTH] INFO: Login by token. IP=172.16.121.65 2023-09-24T05:07:46-04:00 [AUTH] INFO: Username from JWT: admin 2023-09-24T05:07:46-04:00 [AUTH] INFO: Successfully logged in user admin. IP=172.16.121.65 2023-09-24T05:08:01-04:00 [MANAGE] INFO: Delete Monitor: 25 User ID: 1 2023-09-24T05:08:32-04:00 [AUTH] INFO: Login by token. IP=172.16.121.65 2023-09-24T05:08:32-04:00 [AUTH] INFO: Username from JWT: admin 2023-09-24T05:10:04-04:00 [MONITOR] ERROR: Caught error 2023-09-24T05:10:04-04:00 [MONITOR] ERROR: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? 2023-09-24T05:10:32-04:00 [AUTH] ERROR: Invalid token. IP=172.16.121.65 2023-09-24T05:10:32-04:00 [MONITOR] WARN: Monitor #73 'WebDocs K8S Production': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: http 2023-09-24T05:10:32-04:00 [MONITOR] WARN: Monitor #74 'MobileIron Sentry Admin Portal': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: http 2023-09-24T05:10:33-04:00 [MONITOR] WARN: Monitor #75 'MobileIron VSP': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: http 2023-09-24T05:10:33-04:00 [MONITOR] WARN: Monitor #76 'Avaya CM1 Management SSH': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: port 2023-09-24T05:10:34-04:00 [MONITOR] WARN: Monitor #77 'Avaya CM2 Management SSH': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: port 2023-09-24T05:10:35-04:00 [MONITOR] WARN: Monitor #78 'Workwave API Server': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: http

chakflying commented 1 year ago

Deleting a monitor can take a long time if the database has been in use for a while and or the retention period is long, during which the server may be unresponsive. I'm guessing your K8 config has a periodic health check, and failing the health check K8 sent SIGTERM to shutdown the server.

I think your option would be to disable the health check, delete the monitor using the UI and hope that it completes after a while,

or alternatively shutdown the server, open the SQLite database using an external tool, and delete the monitor manually from there. Remember to check that foreign_keys are turned on if you do so.

jjschwarz commented 1 year ago

I will try to remove the health check and test the delete. Any suggested health check limits?

Jason John Schwarz eMail: @.*** Amateur Radio: N4JJS

On Sep 24, 2023, at 5:38 AM, Nelson Chan @.***> wrote:

Deleting a monitor can take a long time if the database has been in use for a while and or the retention period is long, during which the server may be unresponsive. I'm guessing your K8 config has a periodic health check, and failing the health check K8 sent SIGTERM to shutdown the server.

I think your option would be to disable the health check, delete the monitor using the UI and hope that it completes after a while,

or alternatively shutdown the server, open the SQLite database using an external tool, and delete the monitor manually from there. Remember to check that foreign_keys are turned on if you do so.

— Reply to this email directly, view it on GitHub https://github.com/louislam/uptime-kuma/issues/3797#issuecomment-1732531750, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIBX4LLUMZNIUDOUN3OACDX375QBANCNFSM6AAAAAA5EZJJXQ. You are receiving this because you authored the thread.

chakflying commented 1 year ago

I don't think anyone has done any serious benchmarking on how long it takes, so do report back if you succeed since it would be a great data point!

Of course it would be best to restore the original health check once you are done deleting monitors.

louislam commented 1 year ago

Duplicate of #2781

And unfortunately, I think it could only be solved by using MariaDB/MySQL in upcoming 2.0.0.

jjschwarz commented 1 year ago

I removed the health check, but it never seems to have recovered. I waited about 15 minutes and then restarted it. This time the restart was clean, so my guess is it completed the delete but the app might have crashed in the meantime. I had 30 second health checks, so we expanded that to 90 seconds. Interesting that I stopped getting errors in the logs after 2 minutes or so, but the system was still up (non responsive).

uptime-kuma.log

chakflying commented 1 year ago

Looking at the logs made me realize that all reads are blocked when writing. This contradicts my understanding that in SQLite WAL mode, writers should not block readers. I'm assuming it's because we are only using a single connection across the entire application.

@louislam Do you remember what was the issue with using multiple connections in the sqlite connection pool config? If there was some memory leak, is it worth investigating to see if we can work around that?

Alternatively, I think I can modify the deleteMonitor function to delete old heartbeats first over fixed size batches, until all heartbeats are cleared, then delete the monitor. This should prevent a huge cascade delete blocking for too long.