HTTP monitors all have a status of "canceled" after update to 1.23.4

baubie commented 11 months ago

⚠️ Please verify that this bug has NOT been raised before.

[X] I checked and didn't find similar issue

🛡️ Security Policy

[X] I agree to have read this project Security Policy

Description

HTTP monitors all showing "canceled" after update to 1.23.4 and it must be related to https://github.com/louislam/uptime-kuma/pull/3961#issuecomment-1804149144

When I checked, the timeout for these monitors were all set to 2880 seconds (I don't think I set this). I set them down to something more reasonable (a couple of seconds) and all monitors work again.

I didn't see any other posts about this and it may just affect me as I've been running Uptime Kuma for a long time now.

FIX: Update the timeout of your HTTP monitors to a new value.

👟 Reproduction steps

Have old monitors and upgrade to 1.23.4 (from 1.23.2 in my case).

👀 Expected behavior

Monitors will report being up.

😓 Actual Behavior

Monitors reported down with the status "canceled".

🐻 Uptime-Kuma Version

1.23.4

💻 Operating System and Arch

UBuntu 22.04

🌐 Browser

N/A

🐋 Docker Version

No response

🟩 NodeJS Version

No response

📝 Relevant log output

No response

RoiArthurB commented 11 months ago

Same here.

As a workaround, I noticed that enabling Ignore TLS/SSL error for HTTPS websites does fix this issue and makes the monitor works as before.

louislam commented 11 months ago

Ops, that is strange. I thought I handled it. Need to investigate.

I thought https://github.com/louislam/uptime-kuma/pull/4011 was fixed issue, but clearly not.

Do you guys know how to use SQLite Expert to view data? I want to see what is inside your timeout field.

kWAYTV commented 11 months ago

Ops, that is strange. I thought I handled it. Need to investigate.

I thought #4011 was fixed issue, but clearly not.

Do you guys know how to use SQLite Expert to view data? I want to see what is inside your timeout field.

Same issue here, clean first time install. Screenshot:

chakflying commented 11 months ago

I tried reproducing this by creating a new database in branch 1.21.X, adding HTTP monitors with intervals 60 and 360, then upgrading to 1.23.X, but it still works fine as expected.

Maybe you can post the config of the monitor here (masking out anything sensitive), and the node.js version used (or docker).

megosugit commented 11 months ago

Same here.

As a workaround, I noticed that enabling Ignore TLS/SSL error for HTTPS websites does fix this issue and makes the monitor works as before.

not for me :(

CommanderStorm commented 11 months ago

@chakflying noted in #4033 that

This may be a security fix in node.js v17+. Is the server running behind a corporate VPN/firewall?

(Related: https://github.com/nodejs/node/issues/45378, blog post)

Could the people affected please report:

if you are running behind a corporate firewall
what version of node/docker you are running on

chakflying commented 11 months ago

I don't think #4033 is related to this since that has a specific error message.

kokofixcomputers commented 11 months ago

Ops, that is strange. I thought I handled it. Need to investigate.

I thought #4011 was fixed issue, but clearly not.

Do you guys know how to use SQLite Expert to view data? I want to see what is inside your timeout field.

Thats a weird issue! So this only happened after upgrading to 1.23.4? and should i upgrade to 1.23.4? or should i wait?

JMyrng commented 11 months ago

After downgrading to 1.23.3 the monitors are up and running again.

megosugit commented 11 months ago

After downgrading to 1.23.3 the monitors are up and running again.

is it safe to downgrade while keeping all data ? Is it just a matter of pulling 1.23.3 and recreating the container ? @JMyrng :)

JMyrng commented 11 months ago

After downgrading to 1.23.3 the monitors are up and running again.

is it safe to downgrade while keeping all data ? Is it just a matter of pulling 1.23.3 and recreating the container ? @JMyrng :)

Yes, I've done exactly that.

megosugit commented 11 months ago

After downgrading to 1.23.3 the monitors are up and running again.

is it safe to downgrade while keeping all data ? Is it just a matter of pulling 1.23.3 and recreating the container ? @JMyrng :)

Yes, I've done exactly that.

thank you, I'm going to do the same as I keep receiving Cancelled notifications for no reason and my paused monitors get unpaused magically :/

chakflying commented 11 months ago

Monitors unpausing is another issue that should be fixed with #4032.

Again, for those who have experienced this exact issue (HTTP monitor getting "Cancelled" when it's actually up), please post your node.js version and OS (or docker) so we can help troubleshoot. 🙇🏻‍♂️

megosugit commented 11 months ago

I saw about #4032, good to know a fix is coming, thank you.

My versions: Docker 20.10.23 (API: 1.41) (running on Synology NAS) node.js 16.20.2

megosugit commented 11 months ago

I'm not sure if I had exactly the same issue regarding the Cancelled message. In my case, I started a new install with 1.23.3, then upgraded to 1.23.4. I then created a new HTTPS monitor to check a website that was currently down, and I kept receiving notifications endlessly until the website was finally up. In the GUI, it was looping between pending (orange) and down (red).

It shows repetitive Down/cancelled events:

louislam commented 11 months ago

Just release a beta version for fixing this issue:

louislam/uptime-kuma:1.23.5-beta.0

Would be appreciated if you guys could help to test it.

kWAYTV commented 11 months ago

Just release a beta version for fixing this issue:

louislam/uptime-kuma:1.23.5-beta.0

Would be appreciated if you guys could help to test it.

louislam commented 11 months ago

Just release a beta version for fixing this issue: louislam/uptime-kuma:1.23.5-beta.0 Would be appreciated if you guys could help to test it.

You did not use 1.23.5-beta.0

kWAYTV commented 11 months ago

Just release a beta version for fixing this issue: louislam/uptime-kuma:1.23.5-beta.0 Would be appreciated if you guys could help to test it.

You did not use 1.23.5-beta.0

my bad, now i am using it:

louislam commented 11 months ago

That's strange. I just released 1.23.6. Free feel to try.

If it is still not working. Please change the tag to 1.23.3.

JMyrng commented 11 months ago

That's strange. I just released 1.23.6. Free feel to try.

If it is still not working. Please change the tag to 1.23.3.

Installed 1.23.6 but issue is still there. So reverting to 1.23.3.

Could it be related to IPv6? Pushover fails to send notifications:

Error: Error: Error: connect ETIMEDOUT 2606:4700:10::6814:2bec:443 
    at Pushover.throwGeneralAxiosError (/app/server/notification-providers/notification-provider.js:38:15)
    at Pushover.send (/app/server/notification-providers/pushover.js:41:18)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Monitor.sendNotification (/app/server/model/monitor.js:1433:21)
    at async beat (/app/server/model/monitor.js:964:21)
    at async safeBeat (/app/server/model/monitor.js:1034:17)

Also there are services that are not affected by this strange "cancelled" bug. Both of them have only A Records.

Hit me up if there is anything else I could help you with!

louislam commented 11 months ago

@JMyrng Thank you so much.

I just created a new image for printing more info and +10 seconds for AbortSignal (#4053)

louislam/uptime-kuma:nightly

Now the log will output something like this:

2023-11-18T23:22:18+08:00 [DEBUG_TIMEOUT] INFO: monitor id: 58
2023-11-18T23:22:18+08:00 [DEBUG_TIMEOUT] INFO: timeout: 16
2023-11-18T23:22:18+08:00 [DEBUG_TIMEOUT] INFO: interval: 20
2023-11-18T23:22:18+08:00 [DEBUG_TIMEOUT] INFO: Using AbortSignal.timeout: 26000

The monitor id could be found in the browser address bar.

Would be appropriated if you could show the log here for the monitors that are having issues.

JMyrng commented 11 months ago

Would be appropriated if you could show the log here for the monitors that are having issues.

2023-11-18T16:34:05+01:00 [DEBUG_TIMEOUT] INFO: monitor id: 11
2023-11-18T16:34:05+01:00 [DEBUG_TIMEOUT] INFO: timeout: 5
2023-11-18T16:34:05+01:00 [DEBUG_TIMEOUT] INFO: interval: 60
2023-11-18T16:34:05+01:00 [DEBUG_TIMEOUT] INFO: Using AbortSignal.timeout: 15000
2023-11-18T16:34:10+01:00 [MONITOR] WARN: Monitor #11 'RAE Tunnel': Pending: timeout of 5000ms exceeded | Max retries: 1 | Retry: 1 | Retry Interval: 20 seconds | Type: json-query

louislam commented 11 months ago

Would be appropriated if you could show the log here for the monitors that are having issues.

2023-11-18T16:34:05+01:00 [DEBUG_TIMEOUT] INFO: monitor id: 11
2023-11-18T16:34:05+01:00 [DEBUG_TIMEOUT] INFO: timeout: 5
2023-11-18T16:34:05+01:00 [DEBUG_TIMEOUT] INFO: interval: 60
2023-11-18T16:34:05+01:00 [DEBUG_TIMEOUT] INFO: Using AbortSignal.timeout: 15000
2023-11-18T16:34:10+01:00 [MONITOR] WARN: Monitor #11 'RAE Tunnel': Pending: timeout of 5000ms exceeded | Max retries: 1 | Retry: 1 | Retry Interval: 20 seconds | Type: json-query

Wait, it seems that it is actually a real timeout that is longer than 5 seconds?

Could you show this graph here?

JMyrng commented 11 months ago

Screenshot_2023-11-18-17-18-20-70_40deb401b9ffe8e1df2f1cc5ba480b12.jpg

I'm on mobile right now, but the red/orange is the timespan while on nightly. Green is on old version.

louislam commented 11 months ago

Thanks for your help, which make me wondering, we might be looking into a wrong direction. It may be a Node.js 18 issue.

I will build two more docker images later.

louislam commented 11 months ago

Node 18 no Signal Patch: louislam/uptime-kuma:nightly-node18-no-signal
Node 16 with Signal Patch: louislam/uptime-kuma:nightly-node16-with-signal
Node 16 no Signal Patch (This should be similar to 1.23.3): louislam/uptime-kuma:nightly-node16-no-signal

Looking forward to the results. 👀

chakflying commented 11 months ago

This is another layered cake situation as I managed to reproduce the "Retries getting reset" part of the problem.

Since this is 1.23.X, it does not have the accurate interval fix (#3072). This means that any blocking timeout would count towards the beat interval.
When the beat interval is set to 60, and timeout is set to 48, part of that 48 seconds gets counted towards the beat interval.

In practice, this means that continuously timing-out would pass the threshold where checkMonitors would consider it as stuck.

[MONITOR_CHECKER] ERROR: Monitor Interval: 60 Monitor 1 lastStartBeatTime diff: 95
[MONITOR_CHECKER] ERROR: Unexpected error: Monitor 1 is struck for unknown reason

Hence the monitor is restarted, retries gets reset, and notifications get sent again.

I think we should either backport #3072, or increase the checkMonitors tolerance to prevent this.

louislam commented 11 months ago

You are right, the pattern is similar to @JMyrng's graph, assume that he did not restart the monitor manually.

kWAYTV commented 11 months ago

So i just have to restart the monitor? When i updated i completely removed the old & pulled out the new ver, i assumed that was enough, i'm using portainer tho

louislam commented 11 months ago

So i just have to restart the monitor? When i updated i completely removed the old & pulled out the new ver, i assumed that was enough, i'm using portainer tho

No, that comment was replied to @chakflying's comment.

You could try these images I mentioned in https://github.com/louislam/uptime-kuma/issues/4027#issuecomment-1817576022

kWAYTV commented 11 months ago

So i just have to restart the monitor? When i updated i completely removed the old & pulled out the new ver, i assumed that was enough, i'm using portainer tho

No, that comment was replied to @chakflying's comment.

You could try these images I mentioned in #4027 (comment)

I'm starting to think it's my container's issue cause i went into portainer container, click on console and then pinged google.com and these were the results

JMyrng commented 11 months ago

Node 18 no Signal Patch: louislam/uptime-kuma:nightly-node18-no-signal

Node 16 with Signal Patch: louislam/uptime-kuma:nightly-node16-with-signal

Node 16 no Signal Patch (This should be similar to 1.23.3): louislam/uptime-kuma:nightly-node16-no-signal

Looking forward to the results. 👀

nightly-node18-no-signal: same as here but it's not cancelled but timeout
nightly-node16-with-signal: is similar to 1.23.3, everything is up again and Pushover is working again
nightly-node16-no-signal: same as nightly-node16-with-signal

Has node 18 changed anything about the use of IPv6? My IPv4 only monitors are still not affected by this bug. I believe that IPv6 is not set up correctly in the container and the preferred variant has changed. I'll try to disable IPv6 later.

chakflying commented 11 months ago

Starting from node v17, the DNS lookup of node would preserve the IPv4/IPv6 priority returned by the resolver (nodejs/node#39987), whereas before it would always prefer IPv4. This means that compared to before, some connections will switch to using IPv6 when available.

JMyrng commented 11 months ago

Starting from node v17, the DNS lookup of node would preserve the IPv4/IPv6 priority returned by the resolver (nodejs/node#39987), whereas before it would always prefer IPv4. This means that compared to before, some connections will switch to using v6 when available.

How can I set dns.lookup verbatim option to false to test this?

JMyrng commented 11 months ago

I'll try to disable IPv6 later.

Got it disabled in the docker network, everything is working now with 'latest' tag. So in my case it was a incorrectly configured ipv6 setup.

louislam commented 11 months ago

@JMyrng Thanks for testing and finally found the problem. I will mention this in the changelog, as I unlikely downgrade back to Node.js 16.

@chakflying For https://github.com/louislam/uptime-kuma/issues/4027#issuecomment-1817579493, feel free to continue in a new issue.

ignatiusmadee commented 11 months ago

I have same issue, 1.23.6, sometimes got canceled status update from 1.23.3 to 1.23.6 node version 20.9.0 and then I check some of my web cert exp. not display

louislam / uptime-kuma