Open JustANewGuy opened 2 years ago
It also appears to cache the same items over and over (even though they have the same UUIDs):
Running into same issue here. Sometimes getting a node memory leak error tossed in as well:
[13/10/2023, 11:09:48] [Alarm] Startup complete
[13/10/2023, 11:09:52] [Alarm] Attempting to connect to Elk M1
[13/10/2023, 11:09:52] [Alarm] ***Connected***
[13/10/2023, 11:09:53] [Alarm] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: . Will retry in 5s
[13/10/2023, 11:09:53] [Alarm] Restoring existing panel from cache: Area 2
[13/10/2023, 11:09:53] [Alarm] Restoring existing accessory from cache: FRONT DOOR
[13/10/2023, 11:09:53] [Alarm] Startup complete
[13/10/2023, 11:09:58] [Alarm] Attempting to connect to Elk M1
[13/10/2023, 11:09:58] [Alarm] ***Connected***
[13/10/2023, 11:09:58] [Alarm] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: . Will retry in 5s
[13/10/2023, 11:09:58] [Alarm] Restoring existing panel from cache: Area 2
[13/10/2023, 11:09:58] [Alarm] Restoring existing accessory from cache: FRONT DOOR
[13/10/2023, 11:09:58] [Alarm] Startup complete
(node:8686) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 AS listeners added to [Elk]. Use emitter.setMaxListeners() to increase limit
(Use `node --trace-warnings ...` to show where the warning was created)
[13/10/2023, 11:10:03] [Alarm] Attempting to connect to Elk M1
[13/10/2023, 11:10:03] [Alarm] ***Connected***
[13/10/2023, 11:10:03] [Alarm] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: . Will retry in 5s
[13/10/2023, 11:10:03] [Alarm] Restoring existing panel from cache: Area 2
[13/10/2023, 11:10:03] [Alarm] Restoring existing accessory from cache: FRONT DOOR
[13/10/2023, 11:10:03] [Alarm] Startup complete
...
versions:
Node.js Version | v18.18.0
npm@9.8.1
Homebridge V1.6.1
homebridge-elkm1@3.0.7
Turning on debug logging, I wonder if the size of this message is the issue:
[13/10/2023, 11:51:48] [Alarm] Attempting to connect to Elk M1
[13/10/2023, 11:51:48] [Alarm] ***Connected***
[13/10/2023, 11:51:48] [Alarm] ZoneStatusReport {
message: 'D6ZS33000330333330002222222222220006',
body: '330003303333300022222222222200',
type: 'ZS',
hexLength: 'D6',
checkSum: '06',
zones: [
{ id: 1, physicalStatus: 'Short', logicalState: 'Normal' },
{ id: 2, physicalStatus: 'Short', logicalState: 'Normal' },
{ id: 3, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 4, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 5, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 6, physicalStatus: 'Short', logicalState: 'Normal' },
{ id: 7, physicalStatus: 'Short', logicalState: 'Normal' },
{ id: 8, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 9, physicalStatus: 'Short', logicalState: 'Normal' },
{ id: 10, physicalStatus: 'Short', logicalState: 'Normal' },
{ id: 11, physicalStatus: 'Short', logicalState: 'Normal' },
{ id: 12, physicalStatus: 'Short', logicalState: 'Normal' },
{ id: 13, physicalStatus: 'Short', logicalState: 'Normal' },
{ id: 14, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 15, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 16, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 17, physicalStatus: 'EOL', logicalState: 'Normal' },
{ id: 18, physicalStatus: 'EOL', logicalState: 'Normal' },
{ id: 19, physicalStatus: 'EOL', logicalState: 'Normal' },
{ id: 20, physicalStatus: 'EOL', logicalState: 'Normal' },
{ id: 21, physicalStatus: 'EOL', logicalState: 'Normal' },
{ id: 22, physicalStatus: 'EOL', logicalState: 'Normal' },
{ id: 23, physicalStatus: 'EOL', logicalState: 'Normal' },
{ id: 24, physicalStatus: 'EOL', logicalState: 'Normal' },
{ id: 25, physicalStatus: 'EOL', logicalState: 'Normal' },
{ id: 26, physicalStatus: 'EOL', logicalState: 'Normal' },
{ id: 27, physicalStatus: 'EOL', logicalState: 'Normal' },
{ id: 28, physicalStatus: 'EOL', logicalState: 'Normal' },
{ id: 29, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 30, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 31, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 32, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 33, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 34, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 35, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 36, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 37, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 38, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 39, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 40, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 41, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 42, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 43, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 44, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 45, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 46, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 47, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 48, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 49, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 50, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 51, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 52, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 53, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 54, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 55, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 56, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 57, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 58, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 59, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 60, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 61, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 62, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 63, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 64, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 65, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 66, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 67, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 68, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 69, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 70, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 71, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 72, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 73, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 74, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 75, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 76, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 77, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 78, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 79, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 80, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 81, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 82, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 83, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 84, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 85, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 86, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 87, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 88, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 89, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 90, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 91, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 92, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 93, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 94, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 95, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 96, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 97, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 98, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 99, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
{ id: 100, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
... 108 more items
]
}
[13/10/2023, 11:51:48] [Alarm] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: . Will retry in 5s
[13/10/2023, 11:51:48] [Alarm] Requesting area description
Is it possible the ... 108 more items
is somehow part of the checksum calculation -- where somebody used a trimmed debug/human output rather than the raw value?
Or the lack of output after the colon here (compared to the original example above) seems suspicious:
received message: . Will retry
Seems like all those zeros are in there:
0001460 t h e r e c e i v e d m e
7420 6568 7220 6365 6965 6576 2064 656d
0001500 s s a g e : \0 \0 \0 \0 \0 \0 \0 \0 \0
7373 6761 3a65 0020 0000 0000 0000 0000
0001520 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
0000 0000 0000 0000 0000 0000 0000 0000
*
0001760 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 . W i l
0000 0000 0000 0000 0000 2e00 5720 6c69
0002000 l r e t r y i n 5 s 033 [ 3
OK responding back to this thread to the next person who happens to google this. I don't think my specific issue of a shorter than expected Zone Report is specific to this plugin, or even the underlying elkm1
library that does the ASCII to SDK translation.
In my case, and I don't know why this is, I had this working and then it stopped working. At some point the status reports went from being the required 208 characters to what you saw above with a bunch of null 0x00
instead of ascii representation of zeros 0x3030
which caused the underlying library to error out.
A hard powerdown/restart of the alarm system (and the ethernet module) seems to have restored things.
Before:
06zs004D
D6ZS33000330333330002222222222220006 (has invisible null as you can see in following pipe to `od`)
od of raw result before:
Or the lack of output after the colon here (compared to the original example above) seems suspicious:
received message: . Will retry
Seems like all those zeros are in there:
0001460 t h e r e c e i v e d m e
7420 6568 7220 6365 6965 6576 2064 656d
0001500 s s a g e : \0 \0 \0 \0 \0 \0 \0 \0 \0
7373 6761 3a65 0020 0000 0000 0000 0000
0001520 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
0000 0000 0000 0000 0000 0000 0000 0000
*
0001760 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 . W i l
0000 0000 0000 0000 0000 2e00 5720 6c69
0002000 l r e t r y i n 5 s 033 [ 3
After restart:
06zs004D
D6ZS3300033033333000000002000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000005C
This is on the wire which means the underlying homebridge wrappers are not to blame here.
That said, I still think the handling of the error putting the code into a rapid restart/recaching over and over is still a legit ask here on this PR.
Cheers.
I ran into what I think is a similar issue over the weekend. Something happened in Elk land and once the connection was lost and not immediately re-established, this plugin spun out of control effectively DOS'ing the elk. I have a separate automation integration that uses port 2601 and it was also unable to connect reliably due to the constant retries.
I shut down the plug-in, let everything else run for 24 hours to ensure that was the source, and now turned it back on but I'm worried that it will happen again.
The underlying issue, I agree, is the retry cycle must have a working delay in order to avoid this situation.
There are two problems that I can see here:
platform.ts:
this.elk.on('error', (err) => {
this.log.error(`Error connecting to ElkM1 ${err}. Will retry in ${this.retryDelay/1000}s`);
setTimeout(() => {
this.connect();
}, this.retryDelay);
this.retryDelay = this.retryDelay * 2;
});
When trying to back-off on re-attempts, it increases the initial delay (5000ms) by 2x on each error. Unfortunately if the Elk is having an issue, it doesn't take long to exceed the size of an int variable and you start to get these errors in the log:
(node:157) TimeoutOverflowWarning: kube62 homebridge fit into a 32-bit signed integer.
[11/4/2023, 5:59:54 PM] kube62 homebridge connecting to ElkM1 ETIMEDOUT. Will retry in Infinitys |
Timeout duration kube62 homebridge 1.
The fix would be to cap it at about 30 seconds which is very reasonable for polling a local device in an error state without spamming it:
this.retryDelay = Math.min(this.retryDelay * 2, 30000)
https://github.com/sgentry/elkmon/blob/master/src/index.ts
// error event handler
this.connection.on('error', (err) => {
if (err.code === 'ECONNREFUSED') {
this.emit('error', 'Connection to M1XEP failed!');
} else if (err.code === 'ECONNRESET') {
this.emit('error', err.code);
// connection was reset, attempt to reconnect
if (this.connection) {
this.connection.destroy();
}
this.connect();
} else {
this.emit('error', err.code);
}
});
This should be more consistent in that package (either retrying or not) as well, but unfortunately there is no way to throttle this through just the Homebridge plugin and a PR would need to be added over there as well. I submitted a PR here for the local fix, and opened an issue on elkmon for the second:
I recently upgrade to the latest version, sorry cant remember what I was running previously but it was quite old and have recently struck an issue that I originally thought was caused by powering my m1xep off a battery that was dying during the M1's daily battery test but must be something else that I need to track down.
The issue it showed was it looks like the retry timer doesnt appear to be working and it keeps smashing connections instead of backing off which then chews up cpu and I think memory, here is a very short example of what I see but sometimes it has gone on for hours
[3/23/2022, 5:14:12 AM] [ElkM1G] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: DZS22220022222220200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061. Will retry in 5s [3/23/2022, 5:14:12 AM] [ElkM1G] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: DZS22220022222220200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061. Will retry in 10s [3/23/2022, 5:14:12 AM] [ElkM1G] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: DZS22220022222220200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061. Will retry in 20s [3/23/2022, 5:14:12 AM] [ElkM1G] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: DZS22220022222220200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061. Will retry in 40s [3/23/2022, 5:14:12 AM] [ElkM1G] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: DZS22220022222220200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061. Will retry in 80s [3/23/2022, 5:14:12 AM] [ElkM1G] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: DZS22220022222220200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061. Will retry in 160s [3/23/2022, 5:14:12 AM] [ElkM1G] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: DZS22220022222220200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061. Will retry in 320s [3/23/2022, 5:14:12 AM] [ElkM1G] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: DZS22220022222220200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061. Will retry in 640s [3/23/2022, 5:14:17 AM] [ElkM1G] Attempting to connect to Elk M1 [3/23/2022, 5:14:17 AM] [ElkM1G] Connected
From the stauts page
Node.js Version | v16.14.0 NPM version v8.3.1 Homebridge V1.4.0 homebridge-elkm1 v3.0.7 running one other real plugin homebridge-cbus v1.1.0