tagyoureit / nodejs-poolController

An application to control pool equipment from various manufacturers.
GNU Affero General Public License v3.0
315 stars 94 forks source link

Probe for com status #636

Closed snagytx closed 1 year ago

snagytx commented 1 year ago

💥 Proposal

What feature you'd like to see

An API call which can be accessed and used in docker containers as health check

Motivation

Heath check monitor is more of a workarround for environment specific issue, or other "bugs" in the product,

For example from time to time I get:

poolcontroller              | [11/15/2022, 1:01:53 PM] info: Net connect (socat) 0 connected to: 10.10.10.64:8899
poolcontroller              | [11/15/2022, 1:01:53 PM] info: Net connect (socat) Connection 0 connected
poolcontroller              | [11/15/2022, 1:01:53 PM] info: Net connect (socat) 0 ready and communicating: 10.10.10.64:8899
poolcontroller              | [11/15/2022, 1:01:53 PM] info: Net connect (socat) 0 end event was fired
poolcontroller              | [11/15/2022, 1:01:53 PM] info: Net connect (socat) 0 closed : 10.10.10.64:8899
poolcontroller              | [11/15/2022, 1:02:03 PM] info: Net connect (socat) 0 connected to: 10.10.10.64:8899
poolcontroller              | [11/15/2022, 1:02:03 PM] info: Net connect (socat) Connection 0 connected
poolcontroller              | [11/15/2022, 1:02:03 PM] info: Net connect (socat) 0 ready and communicating: 10.10.10.64:8899
poolcontroller              | [11/15/2022, 1:02:03 PM] info: Net connect (socat) 0 end event was fired
poolcontroller              | [11/15/2022, 1:02:03 PM] info: Net connect (socat) 0 closed : 10.10.10.64:8899
poolcontroller              | [11/15/2022, 1:02:13 PM] info: Net connect (socat) 0 connected to: 10.10.10.64:8899
poolcontroller              | [11/15/2022, 1:02:13 PM] info: Net connect (socat) Connection 0 connected
poolcontroller              | [11/15/2022, 1:02:13 PM] info: Net connect (socat) 0 ready and communicating: 10.10.10.64:8899
poolcontroller              | [11/15/2022, 1:02:13 PM] info: Net connect (socat) 0 end event was fired
poolcontroller              | [11/15/2022, 1:02:13 PM] info: Net connect (socat) 0 closed : 10.10.10.64:8899

but once I restart the container the issue goes away. I had this errors logged for 5-6 hours, overnight, and never recovered until I restarted the container.

The health check API should report if there were 2-3 reconnections and no message communication. The connection idle seems to work fine but in certain cases it doesn't catch the above mentioned case.

Pitch

in docker container I would take advantage of this healts check and trigger a container restart. This would cleanly re-initialize everything.

rstrouse commented 1 year ago

What device do you have at the other end of this? Is this an Elfin EW11? This pattern is because the other end of the socket is closing without an error. Here is another discussion that I had with someone regarding the setup of this device. It appears that the EW11 leaves half-open sockets by default and periodically closes them. See this discussion on TFP. https://www.troublefreepool.com/threads/wiring-elfin-ew11-to-pentair-easytouch.253376/

I also added an endpoint to get the status. Simply call state/rs485Port/0 for the primary port and it will return all the current information related to the port. If you want any of the aux ports use state/rs485Port/1 ...etc.

snagytx commented 1 year ago

Yes, I have an Elfin EW11, I will check the thread and get back to you.

snagytx commented 1 year ago

Yes, my configuration was the same as in that thread. I guess a njspc restart would restart or give EW11 an extra time to breath and would recover.

I'll see how I can use the new API once the new docker image is built. Thank you for adding it.

snagytx commented 1 year ago

In case somebody wants to implement something just like me in docker environment:

Create a file healthcheck.js, I mounted it under /app/data/healthcheck.js, because it was convenient.

For start I mark the healthcheck fail if there are more than 10 reconnects.

The content of the file:

const http = require('http')

const options = {
  host: 'localhost',
  port: '4200',
  timeout: 2000,
  path: '/state/rs485Port/0',
}

const request = http.request(options, (res) => {
  // console.log(`STATUS: ${res.statusCode}`)
  if (res.statusCode === 200) {
    let data = '';
    res.on('data', (chunk) => {
      data += chunk;
    });
    res.on('end', async () => {
      let reconnects = JSON.parse(data).reconnects
      console.log(`reconnects: ${reconnects}`)
      if(reconnects < 10){
        process.exit(0)
      }else{
        process.exit(1)
      }
    });
  } else {
    process.exit(1)
  }
})

request.on('error', function (err) {
  console.log('ERROR', err)
  process.exit(1)
})

request.end()

In docker-compose.yaml I added:

    healthcheck:
      test: [ "CMD", "node", "/app/data/healthcheck.js" ]
      timeout: 5s
      interval: 10s
      retries: 3

This checks every 10 seconds, with a 5 seconds time out, with 3 retires - if all retries fail the container would be marked as unhealthy.

Once the container was recreated I ran the following command to see the healthcheck.js script output:

docker inspect --format "{{json .State.Health}}" poolcontroller | jq

poolcontroller is the name of my njspc container.

To restart unhealthy containers you would need to use willfarrell/autoheal. My docker-compose.yaml file is:

version: '3.3'
services:
    autoheal:
        container_name: autoheal
        restart: always
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock
        environment:
            - AUTOHEAL_CONTAINER_LABEL=all
        image: willfarrell/autoheal

When a container is going unhealthy the following log message is seen in the autoheal:

16-11-2022 16:04:49 Container /poolcontroller (47cc1f486e2c) found to be unhealthy - Restarting container now with 10s timeout
tagyoureit commented 1 year ago

Very creative on the health script. Just curious, when the Elfin is not responding, have you tried a simple save of the RS485 port? That should close and reopen the port again and may either re-establish the socket connections. If that is true, then we should be able to monitor for the same inside the app and fix it without restarting the container. The Elfin definitely seems to be problematic but if re-establishing the connection after 10 reconnects works restarting the container seems like overkill.

snagytx commented 1 year ago

When the messaged in the first post are seen, I did try to turn on/off the lights but I got "Comms port 0 is not open. Message aborted:"

poolcontroller              | [11/15/2022, 11:44:51 AM] info: Net connect (socat) 0 closed : 10.10.10.64:8899
poolcontroller              | [11/15/2022, 11:45:01 AM] info: Net connect (socat) 0 connected to: 10.10.10.64:8899
poolcontroller              | [11/15/2022, 11:45:01 AM] info: Net connect (socat) Connection 0 connected
poolcontroller              | [11/15/2022, 11:45:01 AM] info: Net connect (socat) 0 ready and communicating: 10.10.10.64:8899
poolcontroller              | [11/15/2022, 11:45:01 AM] info: Net connect (socat) 0 end event was fired
poolcontroller              | [11/15/2022, 11:45:01 AM] info: Net connect (socat) 0 closed : 10.10.10.64:8899
poolcontroller              | [11/15/2022, 11:45:08 AM] info: [11:45:08 AM] 10.10.10.18 PUT /state/circuit/setState {"id":5,"state":false}
poolcontroller              | [11/15/2022, 11:45:08 AM] warn: Comms port 0 is not open. Message aborted: 165,1,15,16,168,36,15,0,0,32,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,255,0,0,1,0,5,173
poolcontroller              | [11/15/2022, 11:45:08 AM] warn: Message aborted after 0 attempt(s): 165,1,15,16,168,36,15,0,0,32,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,255,0,0,1,0,5,173
poolcontroller              | [11/15/2022, 11:45:08 AM] error: Comms port 0 is not open. Message aborted 165,1,15,16,168,36,15,0,0,32,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,255,255,255,255,0,0,1,0,5,173
poolcontroller              | [11/15/2022, 11:45:11 AM] info: Net connect (socat) 0 connected to: 10.10.10.64:8899
poolcontroller              | [11/15/2022, 11:45:11 AM] info: Net connect (socat) Connection 0 connected
poolcontroller              | [11/15/2022, 11:45:11 AM] info: Net connect (socat) 0 ready and communicating: 10.10.10.64:8899
poolcontroller              | [11/15/2022, 11:45:11 AM] info: Net connect (socat) 0 end event was fired
poolcontroller              | [11/15/2022, 11:45:11 AM] info: Net connect (socat) 0 closed : 10.10.10.64:8899
tagyoureit commented 1 year ago

Yes, that is expected. I meant to go to the Gears (upper right)-> Controller -> RS485 Port and then click Save Port (with no changes). This may reset the port.

image

snagytx commented 1 year ago

OK, let me disable the healthcheck and wait for it to come back and give it a try

snagytx commented 1 year ago

It seems that the issue is not resolved if try to click on "Save Port"

poolcontroller              | [11/17/2022, 1:19:53 PM] info: Net connect (socat) 0 closed : 10.10.10.64:8899
poolcontroller              | [11/17/2022, 1:20:00 PM] info: [1:20:00 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 1:20:03 PM] info: Net connect (socat) 0 connected to: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 1:20:03 PM] info: Net connect (socat) Connection 0 connected
poolcontroller              | [11/17/2022, 1:20:03 PM] info: Net connect (socat) 0 ready and communicating: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 1:20:03 PM] info: Net connect (socat) 0 end event was fired
poolcontroller              | [11/17/2022, 1:20:03 PM] info: Net connect (socat) 0 closed : 10.10.10.64:8899
poolcontroller              | [11/17/2022, 1:20:07 PM] info: [1:20:07 PM] 10.10.10.21 PUT /app/rs485Port {"enabled":true,"netConnect":true,"mockPort":false,"rs485Port":"/dev/ttyUSB0","netPort":8899,"netHost":"10.10.10.64","portId":0,"portSettings":{"baudRate":9600,"dataBits":8,"parity":"none","stopBits":1},"inactivityRetry":10}
poolcontroller              | [11/17/2022, 1:20:07 PM] info: Net connect (socat) 0 connected to: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 1:20:07 PM] info: Net connect (socat) Connection 0 connected
poolcontroller              | [11/17/2022, 1:20:07 PM] info: Net connect (socat) 0 ready and communicating: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 1:20:07 PM] info: Net connect (socat) 0 end event was fired
poolcontroller              | [11/17/2022, 1:20:07 PM] info: Net connect (socat) 0 closed : 10.10.10.64:8899
poolcontroller              | [11/17/2022, 1:20:10 PM] info: [1:20:10 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 1:20:17 PM] info: Net connect (socat) 0 connected to: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 1:20:17 PM] info: Net connect (socat) Connection 0 connected
poolcontroller              | [11/17/2022, 1:20:17 PM] info: Net connect (socat) 0 ready and communicating: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 1:20:17 PM] info: Net connect (socat) 0 end event was fired
poolcontroller              | [11/17/2022, 1:20:17 PM] info: Net connect (socat) 0 closed : 10.10.10.64:8899
poolcontroller              | [11/17/2022, 1:20:21 PM] info: [1:20:21 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 1:20:23 PM] info: [1:20:23 PM] 10.10.10.21 PUT /app/rs485Port {"enabled":true,"netConnect":true,"mockPort":false,"rs485Port":"/dev/ttyUSB0","netPort":8899,"netHost":"10.10.10.64","portId":0,"portSettings":{"baudRate":9600,"dataBits":8,"parity":"none","stopBits":1},"inactivityRetry":11}
poolcontroller              | [11/17/2022, 1:20:23 PM] info: Net connect (socat) 0 connected to: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 1:20:23 PM] info: Net connect (socat) Connection 0 connected
poolcontroller              | [11/17/2022, 1:20:23 PM] info: Net connect (socat) 0 ready and communicating: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 1:20:23 PM] info: Net connect (socat) 0 end event was fired
poolcontroller              | [11/17/2022, 1:20:23 PM] info: Net connect (socat) 0 closed : 10.10.10.64:8899
poolcontroller              | [11/17/2022, 1:20:31 PM] info: [1:20:31 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 1:20:34 PM] info: Net connect (socat) 0 connected to: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 1:20:34 PM] info: Net connect (socat) Connection 0 connected
poolcontroller              | [11/17/2022, 1:20:34 PM] info: Net connect (socat) 0 ready and communicating: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 1:20:34 PM] info: Net connect (socat) 0 end event was fired
poolcontroller              | [11/17/2022, 1:20:34 PM] info: Net connect (socat) 0 closed : 10.10.10.64:8899

I even tried to increase the Inactivity Timeout and then save, just to have a chance.

The strange part is that the reconnects count does not increase:

image

The Received and Successful numbers keep increasing, but none of the other numbers change.

snagytx commented 1 year ago

Reviewing the logs when the issue started I see:

poolcontroller              | [11/17/2022, 12:25:14 PM] info: [12:25:14 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 12:25:25 PM] info: [12:25:25 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 12:25:35 PM] info: [12:25:35 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 12:25:45 PM] info: [12:25:45 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 12:25:49 PM] warn: Net connect (socat) connection idle: 10.10.10.64:8899 retrying connection.
poolcontroller              | [11/17/2022, 12:25:49 PM] info: Net connect (socat) 0 request close: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 12:25:49 PM] info: Net connect (socat) 0 sent FIN packet: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 12:25:49 PM] info: Net connect (socat) destroy socket: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 12:25:49 PM] info: Net connect (socat) 0 successfully closed: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 12:25:49 PM] info: Net connect (socat) 0 connected to: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 12:25:49 PM] info: Net connect (socat) Connection 0 connected
poolcontroller              | [11/17/2022, 12:25:49 PM] info: Net connect (socat) 0 ready and communicating: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 12:25:56 PM] info: [12:25:56 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 12:26:06 PM] info: [12:26:06 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 12:26:16 PM] info: [12:26:16 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 12:26:26 PM] info: [12:26:26 PM] 127.0.0.1 GET /state/rs485Port/0 {}
....... prev line repeated .......
poolcontroller              | [11/17/2022, 12:33:28 PM] info: [12:33:28 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 12:33:38 PM] info: [12:33:38 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 12:33:48 PM] info: [12:33:48 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 12:33:58 PM] info: [12:33:58 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 12:34:07 PM] warn: Net connect (socat) connection idle: 10.10.10.64:8899 retrying connection.
poolcontroller              | [11/17/2022, 12:34:07 PM] info: Net connect (socat) 0 request close: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 12:34:07 PM] info: Net connect (socat) 0 sent FIN packet: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 12:34:07 PM] info: Net connect (socat) destroy socket: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 12:34:07 PM] info: Net connect (socat) 0 successfully closed: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 12:34:09 PM] info: [12:34:09 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 12:34:17 PM] warn: Net connect (socat) connection idle: 10.10.10.64:8899 retrying connection.
poolcontroller              | [11/17/2022, 12:34:19 PM] info: [12:34:19 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 12:34:27 PM] warn: Net connect (socat) connection idle: 10.10.10.64:8899 retrying connection.
poolcontroller              | [11/17/2022, 12:34:29 PM] info: [12:34:29 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 12:34:37 PM] warn: Net connect (socat) connection idle: 10.10.10.64:8899 retrying connection.
poolcontroller              | [11/17/2022, 12:34:37 PM] info: Net connect (socat) 0 connected to: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 12:34:37 PM] info: Net connect (socat) Connection 0 connected
poolcontroller              | [11/17/2022, 12:34:37 PM] info: Net connect (socat) 0 ready and communicating: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 12:34:39 PM] info: Net connect (socat) 0 connected to: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 12:34:39 PM] warn: Net connect (socat) 0 recovered from lost connection.
poolcontroller              | [11/17/2022, 12:34:39 PM] info: Net connect (socat) Connection 0 connected
poolcontroller              | [11/17/2022, 12:34:39 PM] info: Net connect (socat) 0 ready and communicating: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 12:34:39 PM] info: [12:34:39 PM] 127.0.0.1 GET /state/rs485Port/0 {}
rstrouse commented 1 year ago

Ok what this is telling me is that there are a limited number of sockets on the EW11 and it does not process the FIN to destroy the socket and is probably waiting on some timeout in their code to tear it down. Depending on your equipment, the bus my go idle for longer than 11 seconds that you have it set to. In many circumstances that is an eternity but if the equipment on the bus is not chatty then it is likely the bus is silent for periods of 11 seconds or more.

Increase the inactivity timeout or set it to 0 (disabled) so that it lets the EW11 expire its socket on the timeout.

snagytx commented 1 year ago

When everything works, I see multiple messages maybe every second, in the messageManager.html.

Because it takes about 13 seconds to restart the container, I will set my timeout to 30 seconds.

time docker-compose restart poolcontroller
Restarting poolcontroller ... done

real    0m12.783s
user    0m1.171s
sys     0m0.139s

If 13 seconds is enough for connection to re-establish, when I restart the container, I think 30 seconds should be enough for njspc to reconnect.

rstrouse commented 1 year ago

You would think that to be true. One thing that could be going on here is that the wifi is switching channels on you or the connection gets interference for short periods. That could be making the EW11 orphan (half-close) the socket. Are there timeout settings on the EW11 configuration page? I wonder if decreasing the timeout on the adapter could improve the situation.

I wonder if there is a keep alive setting that does a better job at cleaning up its orphans.

snagytx commented 1 year ago

These are the settings that I can have for the TCP Server:

image

I updated it to:

image

rstrouse commented 1 year ago

Let us know how these changes work out. I assume that the docs are below average on what the numbers mean. I'll take a stab at the purpose of each of these (but I could be wrong).

  1. Keep Alive(s) is probably the number of seconds to send a PING.
  2. Timeout(s) is likely either the connection timeout or a timeout for killing dead sockets.
  3. Max Accept has to be the maximum number of connections that can be made. Do you by chance have MQTT enabled on the adapter?
snagytx commented 1 year ago

with Timeout set to 15 I get:

poolcontroller              | [11/17/2022, 7:42:37 PM] info: Net connect (socat) 0 connected to: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 7:42:37 PM] info: Net connect (socat) Connection 0 connected
poolcontroller              | [11/17/2022, 7:42:37 PM] info: Net connect (socat) 0 ready and communicating: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 7:42:42 PM] info: [7:42:42 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 7:42:52 PM] info: [7:42:52 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 7:42:53 PM] info: Net connect (socat) 0 end event was fired
poolcontroller              | [11/17/2022, 7:42:53 PM] info: Net connect (socat) 0 closed : 10.10.10.64:8899

Increased the Timeout to 60 and I got:

poolcontroller              | [11/17/2022, 7:44:40 PM] info: Net connect (socat) 0 connected to: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 7:44:40 PM] info: Net connect (socat) Connection 0 connected
poolcontroller              | [11/17/2022, 7:44:40 PM] info: Net connect (socat) 0 ready and communicating: 10.10.10.64:8899
poolcontroller              | [11/17/2022, 7:44:45 PM] info: [7:44:45 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 7:44:55 PM] info: [7:44:55 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 7:45:06 PM] info: [7:45:06 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 7:45:16 PM] info: [7:45:16 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 7:45:26 PM] info: [7:45:26 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 7:45:36 PM] info: [7:45:36 PM] 127.0.0.1 GET /state/rs485Port/0 {}
poolcontroller              | [11/17/2022, 7:45:41 PM] info: Net connect (socat) 0 end event was fired
poolcontroller              | [11/17/2022, 7:45:41 PM] info: Net connect (socat) 0 closed : 10.10.10.64:8899

Based on this, is seems that Timeout is the max life of a connection.

I will keep Keep Alive at 15 and keep an eye on it.

I don't have MQTT setup on the adapter. All I have is this one "TCP Server"

rstrouse commented 1 year ago

Well the timeout setting is pretty much good for nothing. One would think that it would reset with communication on the socket. The good news is that it knows how to send an end event to cleanly close the socket in njsPC though.

snagytx commented 1 year ago

This is what is says about that setting:

image

Since njsPC is not sending any data, it will close the socket after timeout. Not sure if njsPC sends any keep-alive, maybe that would help, or that is not considered data?

rstrouse commented 1 year ago

Keep alives not considered data and that negotiation is done by the server. So it only looks at the send side of the socket. In this case we are connected to EasyTouch... correct?

snagytx commented 1 year ago

Yes, EW11 connects into to IntelliCenter's RS485 network, and njsPC connects to EW11 over WiFi.

tagyoureit commented 1 year ago

Have you, @snagytx, (or anyone else @rstrouse) tried the TCP server with Socat?

rstrouse commented 1 year ago

@snagytx there is the rub with this setting. njsPC must remain silent when it is not the master. It should only speak when it either has something to say or it is asked about something. So because the timeout is not related to any successful outgoing traffic or PONGs it will timeout.

@tagyoureit yes I have a socat running on RPi but this is different. This little device is just a microcontroller (probably STM or ESP) running its own socket server. The issue here is getting the right settings. Unfortunately, it half closes a socket when njsPC doesn't get any data for a while and the timeout is fired. This is because njsPC will send a FIN packet but the adapter doesn't close the socket completely. It only closes the receive half. This is the problem.

What happens here is njsPC stops listening on that socket but the adapter continues to hold onto that connection. Since there are only 3 open sockets allowed it eventually stops opening new full connections and only reattaches to the old half open one.

snagytx commented 1 year ago

I was not successful to setup socat.

I tried to have socat on rpi: OS: Raspbian GNU/Linux 11 (bullseye)

and on my docker server, Photon OS 4.0

But on both I get:

2022/11/17 15:00:03 socat[21712] E open("/dev/ttyUSB0", 02, 0666): No such file or directory

Or:

2022/11/17 15:00:35 socat[21749] E open("/dev/ttyACM1", 02, 0666): No such file or director

when I run:

socat TCP-LISTEN:9801,fork,reuseaddr FILE:/dev/ttyUSB0,b9600,raw or socat TCP-LISTEN:9801,fork,reuseaddr FILE:/dev/ttyACM1,b9600,raw

snagytx commented 1 year ago

my intention is to push the messages from EW11 to my socat server, which would put the messages on the serial port to be read by njsPC, but not sure if socat knows how to create a serial port.

rstrouse commented 1 year ago

Why would you add socat in the middle of this communication? All socat does in this instance is create a TCP socket server from the raw RS485 port. That is what the EW11 does so you gain nothing doing that.

The problem here is that when njsPC resets the connection, the EW11 only resets half of it. This can occur when the network loses connection and when on WiFi it can be often especially when that connection is the furthest reaches of the radio network.

njsPC expects to hear from the RS485 feed whether that is over a TCP socket or a hardware port at regular intervals. When it doesn't hear from the port within the Inactivity Timeout interval it will close the connection and recreate it. In this instance the EW11 doesn't fully close the connection even though we send out a FIN packet asking the server to close it. This results in what is referred to as a half-open socket where one side of the socket thinks the connection is closed and the other side thinks it is still open.

So what is occurring in this case is that the TCP connection between the EW11 and njsPC is interrupted for the length of time identified by the Inactivity Timeout but the EW11 never closes the original socket. I am looking at some other potential solutions to this goofy processing.

tagyoureit commented 1 year ago

1.4.3. Multiple Socket The device supports max 5 Socket channels, each socket can work individually as TCP/UDP. Multi Socket simultaneous communication of data stream is illustrated as following

From the manual it looks like it supports up to 5, not just 3. Doubt it will make a difference but can't hurt to bump it up.

rstrouse commented 1 year ago

I think I may have a solution that will force the server to not allow half open sockets. I am extending the options so that they can be tweaked from the client. This should allow us to try out some options when the socket is opened.

rstrouse commented 1 year ago

Pull njsPC and dashPanel. There are now options to not allow half open connections (at least on the njsPC side) and send keep alives to the EW11. This may allow the keep alive options to work on the EW11 so that it does not arbitrarily close the port when set.

snagytx commented 1 year ago

I will do that, once the docker images get build.

rstrouse commented 1 year ago

Don’t forget to hard refresh your browser so the options show up.

snagytx commented 1 year ago

Updated the docker containers and I can see the new options. I will keep an eye on the environment. Overnight I had no issues, with the old docker image and updated configuration for the adapter

image

image

rstrouse commented 1 year ago

Sounds like we are narrowing in on the right settings.

snagytx commented 1 year ago

With the settings from my previous post, I ended up with the issue again.

I tried to use the logging capture feature but I had some issues. Not sure how relevant these logs are.

But I tried to enable/disable the "Primary Port" configuration, I tried to restart the TCP Server on the EW11 adapter, it didn't help. For the TCP Server I increased the Max Accept to 5 and clicked submit, this is how I resstarted the TCP server on EW11.

As mentioned before, I notice that the status of the port was displayed as "closed" but I was able to see the counts for received and successful starts increasing, but once I started to capture the logs, they stopped.

consoleLog(2022-11-18_11-50-15).log config.json.log packetLog(2022-11-18_11-50-15).log

rstrouse commented 1 year ago

Check the Allow Half Open checkbox and try again. We are getting an END event from the EW11. Also increase your Inactivity Retry to 20 seconds. This is greater than the keep alive time.

snagytx commented 1 year ago

Updated to:

image

and keeping:

image

rstrouse commented 1 year ago

I think when it loses wifi connection or changes channels it simply half closes the connection on the server side. This would be fine since njsPC will see that and disconnect the old dead connection then reconnect. Sadly, it doesn't reconnect with a full connection and returns the old half open connection.

Just so you know a half open connection is one where it can receive but not send. Which is why it shows as closed but still receives bytes from the adapter.

Are you by chance running a mesh network?

snagytx commented 1 year ago

Actually, yes I am running a mesh network, using RBR750 and two RBS750.

rstrouse commented 1 year ago

I wonder if Orbi is switching channels. Is any one node close to the pad? There are others running the EW11s and short of the initial config, they have been running fine. I have a pi3 running socat in the garage that also has been running fine for years as well. I too run a mesh network but it is AiMesh.

snagytx commented 1 year ago

I don't think I have a way to check that. Yes, EW11 connects to one of the nodes.

I do realize that most likely there is something specific to my env. I more than happy spend time to test combinations and provide you feedback, but I opened this enhancement, just so that I can work around the issue caused by my environment without really making you spend too much time on it.

rstrouse commented 1 year ago

Let me know how it behaves with the current options we have selected. At some point I may need to get one of these things just to see how many screws are in it.

tagyoureit commented 1 year ago

What about printing out debug messages from the socket? I wonder if that would enlighten us to anything.

rstrouse commented 1 year ago

I’m not sure what we would get out of debug messages. I do know that we are getting a socket end if we try to disable half open sockets (which is expected) but it doesn’t close the socket if Wi-Fi is interrupted. When njsPC tries to reconnect it runs out of available sockets on the device so it reconnects the old half open socket. Maybe if the timeout is enabled on the device with keep alive and allow half open in njsPC it may drop the orphaned connections.