Closed jjschwarz closed 1 year ago
Updated to v1.23.2 image and essentially the same results:
==> Performing startup jobs and maintenance tasks
==> Starting application with user 0 group 0
Welcome to Uptime Kuma
Your Node.js version: 16.20.2
2023-09-24T08:49:55Z [SERVER] INFO: Welcome to Uptime Kuma
2023-09-24T08:49:55Z [SERVER] INFO: Node Env: production
2023-09-24T08:49:55Z [SERVER] INFO: Inside Container: true
2023-09-24T08:49:55Z [SERVER] INFO: Importing Node libraries
2023-09-24T08:49:55Z [SERVER] INFO: Importing 3rd-party libraries
2023-09-24T08:49:56Z [SERVER] INFO: Creating express and socket.io instance
2023-09-24T08:49:56Z [SERVER] INFO: Server Type: HTTP
2023-09-24T08:49:56Z [SERVER] INFO: Importing this project modules
2023-09-24T08:49:56Z [NOTIFICATION] INFO: Prepare Notification Providers
2023-09-24T08:49:56Z [SERVER] INFO: Version: 1.23.2
2023-09-24T08:49:56Z [DB] INFO: Data Dir: ./data/
2023-09-24T08:49:56Z [SERVER] INFO: Connecting to the Database
2023-09-24T08:50:00Z [DB] INFO: SQLite config:
[ { journal_mode: 'wal' } ]
[ { cache_size: -12000 } ]
2023-09-24T08:50:00Z [DB] INFO: SQLite Version: 3.41.1
2023-09-24T08:50:00Z [SERVER] INFO: Connected
2023-09-24T08:50:00Z [DB] INFO: Your database version: 10
2023-09-24T08:50:00Z [DB] INFO: Latest database version: 10
2023-09-24T08:50:00Z [DB] INFO: Database patch not needed
2023-09-24T08:50:00Z [DB] INFO: Database Patch 2.0 Process
2023-09-24T08:50:00Z [DB] INFO: patch-add-invert-keyword.sql is not patched
2023-09-24T08:50:00Z [DB] INFO: patch-add-invert-keyword.sql is patching
2023-09-24T08:51:54Z [SERVER] INFO: Shutdown requested
2023-09-24T08:51:54Z [SERVER] INFO: Called signal: SIGTERM
Trace: Error [ERR_SERVER_NOT_RUNNING]: Server is not running.
at new NodeError (node:internal/errors:387:5)
at Server.close (node:net:1804:12)
at Object.onceWrapper (node:events:627:28)
at Server.emit (node:events:525:35)
at emitCloseNT (node:net:1857:8)
at processTicksAndRejections (node:internal/process/task_queues:82:21) {
code: 'ERR_SERVER_NOT_RUNNING'
}
at process.
I restored the database from the nightly backup and went into the working system. You can see in the attached logs where I delete the same monitor (25)...and then the system falls apart. After a period of time the pod fails. This is all of the log from the pod that first fails after the delete. Once that happens we are back in the loop.
How can I fix this so I can delete a monitor? For now I will revert to the last good backup and wait on suggestions.
==> Performing startup jobs and maintenance tasks ==> Starting application with user 0 group 0 Welcome to Uptime Kuma Your Node.js version: 16.20.2 2023-09-24T09:06:49Z [SERVER] INFO: Welcome to Uptime Kuma 2023-09-24T09:06:49Z [SERVER] INFO: Node Env: production 2023-09-24T09:06:49Z [SERVER] INFO: Inside Container: true 2023-09-24T09:06:49Z [SERVER] INFO: Importing Node libraries 2023-09-24T09:06:49Z [SERVER] INFO: Importing 3rd-party libraries 2023-09-24T09:06:50Z [SERVER] INFO: Creating express and socket.io instance 2023-09-24T09:06:50Z [SERVER] INFO: Server Type: HTTP 2023-09-24T09:06:50Z [SERVER] INFO: Importing this project modules 2023-09-24T09:06:50Z [NOTIFICATION] INFO: Prepare Notification Providers 2023-09-24T09:06:50Z [SERVER] INFO: Version: 1.23.2 2023-09-24T09:06:50Z [DB] INFO: Data Dir: ./data/ 2023-09-24T09:06:50Z [SERVER] INFO: Connecting to the Database 2023-09-24T09:06:50Z [DB] INFO: SQLite config: [ { journal_mode: 'wal' } ] [ { cache_size: -12000 } ] 2023-09-24T09:06:50Z [DB] INFO: SQLite Version: 3.41.1 2023-09-24T09:06:50Z [SERVER] INFO: Connected 2023-09-24T09:06:50Z [DB] INFO: Your database version: 10 2023-09-24T09:06:50Z [DB] INFO: Latest database version: 10 2023-09-24T09:06:50Z [DB] INFO: Database patch not needed 2023-09-24T09:06:50Z [DB] INFO: Database Patch 2.0 Process 2023-09-24T09:06:50Z [DB] INFO: patch-add-invert-keyword.sql is not patched 2023-09-24T09:06:50Z [DB] INFO: patch-add-invert-keyword.sql is patching 2023-09-24T09:06:50Z [DB] INFO: patch-add-invert-keyword.sql was patched successfully 2023-09-24T09:06:50Z [DB] INFO: patch-added-json-query.sql is not patched 2023-09-24T09:06:50Z [DB] INFO: patch-added-json-query.sql is patching 2023-09-24T09:06:50Z [DB] INFO: patch-added-json-query.sql was patched successfully 2023-09-24T09:06:50Z [DB] INFO: patch-added-kafka-producer.sql is not patched 2023-09-24T09:06:50Z [DB] INFO: patch-added-kafka-producer.sql is patching 2023-09-24T09:06:50Z [DB] INFO: patch-added-kafka-producer.sql was patched successfully 2023-09-24T09:06:50Z [DB] INFO: patch-add-certificate-expiry-status-page.sql is not patched 2023-09-24T09:06:50Z [DB] INFO: patch-add-certificate-expiry-status-page.sql is patching 2023-09-24T09:06:50Z [DB] INFO: patch-add-certificate-expiry-status-page.sql was patched successfully 2023-09-24T09:06:50Z [DB] INFO: patch-monitor-oauth-cc.sql is not patched 2023-09-24T09:06:50Z [DB] INFO: patch-monitor-oauth-cc.sql is patching 2023-09-24T09:06:50Z [DB] INFO: patch-monitor-oauth-cc.sql was patched successfully 2023-09-24T09:06:50Z [DB] INFO: patch-add-timeout-monitor.sql is not patched 2023-09-24T09:06:50Z [DB] INFO: patch-add-timeout-monitor.sql is patching 2023-09-24T09:06:50Z [DB] INFO: patch-add-timeout-monitor.sql was patched successfully 2023-09-24T09:06:50Z [DB] INFO: patch-add-gamedig-given-port.sql is not patched 2023-09-24T09:06:50Z [DB] INFO: patch-add-gamedig-given-port.sql is patching 2023-09-24T09:06:50Z [DB] INFO: patch-add-gamedig-given-port.sql was patched successfully 2023-09-24T09:06:50Z [DB] INFO: Database Patched Successfully 2023-09-24T09:06:50Z [SERVER] INFO: Load JWT secret from database. 2023-09-24T05:06:50-04:00 [SERVER] INFO: Adding route 2023-09-24T05:06:50-04:00 [SERVER] INFO: Adding socket handler 2023-09-24T05:06:50-04:00 [SERVER] INFO: Init the server 2023-09-24T05:06:50-04:00 [SERVER] INFO: Listening on 3001 2023-09-24T05:06:50-04:00 [SERVICES] INFO: Starting nscd 2023-09-24T05:07:06-04:00 [AUTH] INFO: Login by token. IP=172.16.121.65 2023-09-24T05:07:06-04:00 [AUTH] INFO: Username from JWT: admin 2023-09-24T05:07:06-04:00 [AUTH] INFO: Successfully logged in user admin. IP=172.16.121.65 2023-09-24T05:07:06-04:00 [MONITOR] WARN: Monitor #25 'Expense System K8S Production': Failing: connect ECONNREFUSED 192.168.1.112:30981 | Interval: 60 seconds | Type: http | Down Count: 0 | Resend Interval: 0 2023-09-24T05:07:32-04:00 [AUTH] INFO: Login by token. IP=172.16.121.65 2023-09-24T05:07:32-04:00 [AUTH] INFO: Username from JWT: admin 2023-09-24T05:07:32-04:00 [AUTH] INFO: Successfully logged in user admin. IP=172.16.121.65 2023-09-24T05:07:46-04:00 [AUTH] INFO: Login by token. IP=172.16.121.65 2023-09-24T05:07:46-04:00 [AUTH] INFO: Username from JWT: admin 2023-09-24T05:07:46-04:00 [AUTH] INFO: Successfully logged in user admin. IP=172.16.121.65 2023-09-24T05:08:01-04:00 [MANAGE] INFO: Delete Monitor: 25 User ID: 1 2023-09-24T05:08:32-04:00 [AUTH] INFO: Login by token. IP=172.16.121.65 2023-09-24T05:08:32-04:00 [AUTH] INFO: Username from JWT: admin 2023-09-24T05:10:04-04:00 [MONITOR] ERROR: Caught error 2023-09-24T05:10:04-04:00 [MONITOR] ERROR: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? 2023-09-24T05:10:32-04:00 [AUTH] ERROR: Invalid token. IP=172.16.121.65 2023-09-24T05:10:32-04:00 [MONITOR] WARN: Monitor #73 'WebDocs K8S Production': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: http 2023-09-24T05:10:32-04:00 [MONITOR] WARN: Monitor #74 'MobileIron Sentry Admin Portal': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: http 2023-09-24T05:10:33-04:00 [MONITOR] WARN: Monitor #75 'MobileIron VSP': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: http 2023-09-24T05:10:33-04:00 [MONITOR] WARN: Monitor #76 'Avaya CM1 Management SSH': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: port 2023-09-24T05:10:34-04:00 [MONITOR] WARN: Monitor #77 'Avaya CM2 Management SSH': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: port 2023-09-24T05:10:35-04:00 [MONITOR] WARN: Monitor #78 'Workwave API Server': Pending: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? | Max retries: 1 | Retry: 1 | Retry Interval: 60 seconds | Type: http
Deleting a monitor can take a long time if the database has been in use for a while and or the retention period is long, during which the server may be unresponsive. I'm guessing your K8 config has a periodic health check, and failing the health check K8 sent SIGTERM to shutdown the server.
I think your option would be to disable the health check, delete the monitor using the UI and hope that it completes after a while,
or alternatively shutdown the server, open the SQLite database using an external tool, and delete the monitor manually from there. Remember to check that foreign_keys are turned on if you do so.
I will try to remove the health check and test the delete. Any suggested health check limits?
Jason John Schwarz eMail: @.*** Amateur Radio: N4JJS
On Sep 24, 2023, at 5:38 AM, Nelson Chan @.***> wrote:
Deleting a monitor can take a long time if the database has been in use for a while and or the retention period is long, during which the server may be unresponsive. I'm guessing your K8 config has a periodic health check, and failing the health check K8 sent SIGTERM to shutdown the server.
I think your option would be to disable the health check, delete the monitor using the UI and hope that it completes after a while,
or alternatively shutdown the server, open the SQLite database using an external tool, and delete the monitor manually from there. Remember to check that foreign_keys are turned on if you do so.
— Reply to this email directly, view it on GitHub https://github.com/louislam/uptime-kuma/issues/3797#issuecomment-1732531750, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIBX4LLUMZNIUDOUN3OACDX375QBANCNFSM6AAAAAA5EZJJXQ. You are receiving this because you authored the thread.
I don't think anyone has done any serious benchmarking on how long it takes, so do report back if you succeed since it would be a great data point!
Of course it would be best to restore the original health check once you are done deleting monitors.
Duplicate of #2781
And unfortunately, I think it could only be solved by using MariaDB/MySQL in upcoming 2.0.0.
I removed the health check, but it never seems to have recovered. I waited about 15 minutes and then restarted it. This time the restart was clean, so my guess is it completed the delete but the app might have crashed in the meantime. I had 30 second health checks, so we expanded that to 90 seconds. Interesting that I stopped getting errors in the logs after 2 minutes or so, but the system was still up (non responsive).
Looking at the logs made me realize that all reads are blocked when writing. This contradicts my understanding that in SQLite WAL mode, writers should not block readers. I'm assuming it's because we are only using a single connection across the entire application.
@louislam Do you remember what was the issue with using multiple connections in the sqlite connection pool config? If there was some memory leak, is it worth investigating to see if we can work around that?
Alternatively, I think I can modify the deleteMonitor
function to delete old heartbeats first over fixed size batches, until all heartbeats are cleared, then delete the monitor. This should prevent a huge cascade delete blocking for too long.
⚠️ Please verify that this bug has NOT been raised before.
🛡️ Security Policy
Description
I removed a single monitor from the system in the UI with the Delete button, after a few minutes of waiting I started getting WebSocket timeouts. I checked and saw the Kubernetes Pods was restarting in a loop
👟 Reproduction steps
Delete monitor and wait
👀 Expected behavior
Monitor disappears and system continues to function
😓 Actual Behavior
System crashes in repeated loop upon start
🐻 Uptime-Kuma Version
1.22.1
💻 Operating System and Arch
Docker image from docker.io/louislam/uptime-kuma
🌐 Browser
FireFox 117.0.1
🐋 Docker Version
K8S v1.26.7
🟩 NodeJS Version
16
📝 Relevant log output