paulw11 / homebridge-elkm1

Homebridge plugin for the Elk M1 alarm panel
MIT License
7 stars 6 forks source link

Retry delay not working? #35

Open JustANewGuy opened 2 years ago

JustANewGuy commented 2 years ago

I recently upgrade to the latest version, sorry cant remember what I was running previously but it was quite old and have recently struck an issue that I originally thought was caused by powering my m1xep off a battery that was dying during the M1's daily battery test but must be something else that I need to track down.

The issue it showed was it looks like the retry timer doesnt appear to be working and it keeps smashing connections instead of backing off which then chews up cpu and I think memory, here is a very short example of what I see but sometimes it has gone on for hours

[3/23/2022, 5:14:12 AM] [ElkM1G] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: DZS22220022222220200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061. Will retry in 5s [3/23/2022, 5:14:12 AM] [ElkM1G] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: DZS22220022222220200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061. Will retry in 10s [3/23/2022, 5:14:12 AM] [ElkM1G] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: DZS22220022222220200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061. Will retry in 20s [3/23/2022, 5:14:12 AM] [ElkM1G] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: DZS22220022222220200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061. Will retry in 40s [3/23/2022, 5:14:12 AM] [ElkM1G] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: DZS22220022222220200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061. Will retry in 80s [3/23/2022, 5:14:12 AM] [ElkM1G] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: DZS22220022222220200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061. Will retry in 160s [3/23/2022, 5:14:12 AM] [ElkM1G] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: DZS22220022222220200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061. Will retry in 320s [3/23/2022, 5:14:12 AM] [ElkM1G] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: DZS22220022222220200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000061. Will retry in 640s [3/23/2022, 5:14:17 AM] [ElkM1G] Attempting to connect to Elk M1 [3/23/2022, 5:14:17 AM] [ElkM1G] Connected

From the stauts page

Node.js Version | v16.14.0 NPM version v8.3.1 Homebridge V1.4.0 homebridge-elkm1 v3.0.7 running one other real plugin homebridge-cbus v1.1.0

bacoboy commented 1 year ago

It also appears to cache the same items over and over (even though they have the same UUIDs):

Screenshot 2023-10-13 at 10 40 44 AM

Running into same issue here. Sometimes getting a node memory leak error tossed in as well:

[13/10/2023, 11:09:48] [Alarm] Startup complete
[13/10/2023, 11:09:52] [Alarm] Attempting to connect to Elk M1
[13/10/2023, 11:09:52] [Alarm] ***Connected***
[13/10/2023, 11:09:53] [Alarm] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: . Will retry in 5s
[13/10/2023, 11:09:53] [Alarm] Restoring existing panel from cache: Area 2
[13/10/2023, 11:09:53] [Alarm] Restoring existing accessory from cache: FRONT DOOR
[13/10/2023, 11:09:53] [Alarm] Startup complete
[13/10/2023, 11:09:58] [Alarm] Attempting to connect to Elk M1
[13/10/2023, 11:09:58] [Alarm] ***Connected***
[13/10/2023, 11:09:58] [Alarm] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: . Will retry in 5s
[13/10/2023, 11:09:58] [Alarm] Restoring existing panel from cache: Area 2
[13/10/2023, 11:09:58] [Alarm] Restoring existing accessory from cache: FRONT DOOR
[13/10/2023, 11:09:58] [Alarm] Startup complete
(node:8686) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 AS listeners added to [Elk]. Use emitter.setMaxListeners() to increase limit
(Use `node --trace-warnings ...` to show where the warning was created)
[13/10/2023, 11:10:03] [Alarm] Attempting to connect to Elk M1
[13/10/2023, 11:10:03] [Alarm] ***Connected***
[13/10/2023, 11:10:03] [Alarm] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: . Will retry in 5s
[13/10/2023, 11:10:03] [Alarm] Restoring existing panel from cache: Area 2
[13/10/2023, 11:10:03] [Alarm] Restoring existing accessory from cache: FRONT DOOR
[13/10/2023, 11:10:03] [Alarm] Startup complete
...

versions:

Node.js Version | v18.18.0
npm@9.8.1
Homebridge V1.6.1
homebridge-elkm1@3.0.7
bacoboy commented 1 year ago

Turning on debug logging, I wonder if the size of this message is the issue:

[13/10/2023, 11:51:48] [Alarm] Attempting to connect to Elk M1
[13/10/2023, 11:51:48] [Alarm] ***Connected***
[13/10/2023, 11:51:48] [Alarm] ZoneStatusReport {
  message: 'D6ZS33000330333330002222222222220006',
  body: '330003303333300022222222222200',
  type: 'ZS',
  hexLength: 'D6',
  checkSum: '06',
  zones: [
    { id: 1, physicalStatus: 'Short', logicalState: 'Normal' },
    { id: 2, physicalStatus: 'Short', logicalState: 'Normal' },
    { id: 3, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 4, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 5, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 6, physicalStatus: 'Short', logicalState: 'Normal' },
    { id: 7, physicalStatus: 'Short', logicalState: 'Normal' },
    { id: 8, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 9, physicalStatus: 'Short', logicalState: 'Normal' },
    { id: 10, physicalStatus: 'Short', logicalState: 'Normal' },
    { id: 11, physicalStatus: 'Short', logicalState: 'Normal' },
    { id: 12, physicalStatus: 'Short', logicalState: 'Normal' },
    { id: 13, physicalStatus: 'Short', logicalState: 'Normal' },
    { id: 14, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 15, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 16, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 17, physicalStatus: 'EOL', logicalState: 'Normal' },
    { id: 18, physicalStatus: 'EOL', logicalState: 'Normal' },
    { id: 19, physicalStatus: 'EOL', logicalState: 'Normal' },
    { id: 20, physicalStatus: 'EOL', logicalState: 'Normal' },
    { id: 21, physicalStatus: 'EOL', logicalState: 'Normal' },
    { id: 22, physicalStatus: 'EOL', logicalState: 'Normal' },
    { id: 23, physicalStatus: 'EOL', logicalState: 'Normal' },
    { id: 24, physicalStatus: 'EOL', logicalState: 'Normal' },
    { id: 25, physicalStatus: 'EOL', logicalState: 'Normal' },
    { id: 26, physicalStatus: 'EOL', logicalState: 'Normal' },
    { id: 27, physicalStatus: 'EOL', logicalState: 'Normal' },
    { id: 28, physicalStatus: 'EOL', logicalState: 'Normal' },
    { id: 29, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 30, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 31, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 32, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 33, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 34, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 35, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 36, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 37, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 38, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 39, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 40, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 41, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 42, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 43, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 44, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 45, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 46, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 47, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 48, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 49, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 50, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 51, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 52, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 53, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 54, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 55, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 56, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 57, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 58, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 59, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 60, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 61, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 62, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 63, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 64, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 65, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 66, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 67, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 68, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 69, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 70, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 71, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 72, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 73, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 74, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 75, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 76, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 77, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 78, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 79, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 80, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 81, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 82, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 83, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 84, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 85, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 86, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 87, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 88, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 89, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 90, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 91, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 92, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 93, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 94, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 95, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 96, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 97, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 98, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 99, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    { id: 100, physicalStatus: 'Unconfigured', logicalState: 'Normal' },
    ... 108 more items
  ]
}
[13/10/2023, 11:51:48] [Alarm] Error connecting to ElkM1 TypeError: The calculated checksum does not match the checksum on the received message: . Will retry in 5s
[13/10/2023, 11:51:48] [Alarm] Requesting area description

Is it possible the ... 108 more items is somehow part of the checksum calculation -- where somebody used a trimmed debug/human output rather than the raw value?

bacoboy commented 1 year ago

Or the lack of output after the colon here (compared to the original example above) seems suspicious:

received message: . Will retry

Seems like all those zeros are in there:

0001460       t   h   e       r   e   c   e   i   v   e   d       m   e
           7420    6568    7220    6365    6965    6576    2064    656d
0001500   s   s   a   g   e   :      \0  \0  \0  \0  \0  \0  \0  \0  \0
           7373    6761    3a65    0020    0000    0000    0000    0000
0001520  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
           0000    0000    0000    0000    0000    0000    0000    0000
*
0001760  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0   .       W   i   l
           0000    0000    0000    0000    0000    2e00    5720    6c69
0002000   l       r   e   t   r   y       i   n       5   s 033   [   3
bacoboy commented 1 year ago

OK responding back to this thread to the next person who happens to google this. I don't think my specific issue of a shorter than expected Zone Report is specific to this plugin, or even the underlying elkm1 library that does the ASCII to SDK translation.

In my case, and I don't know why this is, I had this working and then it stopped working. At some point the status reports went from being the required 208 characters to what you saw above with a bunch of null 0x00 instead of ascii representation of zeros 0x3030 which caused the underlying library to error out.

A hard powerdown/restart of the alarm system (and the ethernet module) seems to have restored things.

Before:

06zs004D
D6ZS33000330333330002222222222220006 (has invisible null as you can see in following pipe to `od`)

od of raw result before:


Or the lack of output after the colon here (compared to the original example above) seems suspicious:

received message: . Will retry
Seems like all those zeros are in there:

0001460       t   h   e       r   e   c   e   i   v   e   d       m   e
           7420    6568    7220    6365    6965    6576    2064    656d
0001500   s   s   a   g   e   :      \0  \0  \0  \0  \0  \0  \0  \0  \0
           7373    6761    3a65    0020    0000    0000    0000    0000
0001520  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
           0000    0000    0000    0000    0000    0000    0000    0000
*
0001760  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0   .       W   i   l
           0000    0000    0000    0000    0000    2e00    5720    6c69
0002000   l       r   e   t   r   y       i   n       5   s 033   [   3

After restart:

06zs004D
D6ZS3300033033333000000002000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000005C

This is on the wire which means the underlying homebridge wrappers are not to blame here.

That said, I still think the handling of the error putting the code into a rapid restart/recaching over and over is still a legit ask here on this PR.

Cheers.

tagdara commented 1 year ago

I ran into what I think is a similar issue over the weekend. Something happened in Elk land and once the connection was lost and not immediately re-established, this plugin spun out of control effectively DOS'ing the elk. I have a separate automation integration that uses port 2601 and it was also unable to connect reliably due to the constant retries.

I shut down the plug-in, let everything else run for 24 hours to ensure that was the source, and now turned it back on but I'm worried that it will happen again.

The underlying issue, I agree, is the retry cycle must have a working delay in order to avoid this situation.

tagdara commented 1 year ago

There are two problems that I can see here:

  1. Is with the doubling of the retry timer on each error eventually growing larger than a signed int can handle.

platform.ts:

        this.elk.on('error', (err) => {
            this.log.error(`Error connecting to ElkM1 ${err}. Will retry in ${this.retryDelay/1000}s`);
            setTimeout(() => {
                this.connect();
            }, this.retryDelay);
            this.retryDelay = this.retryDelay * 2;
        });

When trying to back-off on re-attempts, it increases the initial delay (5000ms) by 2x on each error. Unfortunately if the Elk is having an issue, it doesn't take long to exceed the size of an int variable and you start to get these errors in the log:

(node:157) TimeoutOverflowWarning: kube62 homebridge fit into a 32-bit signed integer. 
[11/4/2023, 5:59:54 PM] kube62 homebridge connecting to ElkM1 ETIMEDOUT. Will retry in Infinitys |  
Timeout duration kube62 homebridge 1.

The fix would be to cap it at about 30 seconds which is very reasonable for polling a local device in an error state without spamming it:

this.retryDelay = Math.min(this.retryDelay * 2, 30000)

  1. Which I think is what the OP ran into is with the underlying 'elkmon' NPM package that this plugin is based on. It also automatically retries the connection on an error - but ONLY if it was due to ECONNRESET - and does not have a delay.

https://github.com/sgentry/elkmon/blob/master/src/index.ts


 // error event handler
    this.connection.on('error', (err) => {
      if (err.code === 'ECONNREFUSED') {
        this.emit('error', 'Connection to M1XEP failed!');
      } else if (err.code === 'ECONNRESET') {
        this.emit('error', err.code);
        // connection was reset, attempt to reconnect
        if (this.connection) {
          this.connection.destroy();
        }
        this.connect();
      } else {
        this.emit('error', err.code);
      }
    });

This should be more consistent in that package (either retrying or not) as well, but unfortunately there is no way to throttle this through just the Homebridge plugin and a PR would need to be added over there as well. I submitted a PR here for the local fix, and opened an issue on elkmon for the second:

https://github.com/sgentry/elkmon/issues/11