srcfl / srcful-gateway

Srcful energy gateway to connect your solar inverter and mine src-token
https://srcful.io
MIT License
5 stars 2 forks source link

Monitor gw service problems #81

Open h0bb3 opened 8 months ago

h0bb3 commented 8 months ago

In some cases the services do not seem to start (or suddenly stop). I.e. there is no failures in the logs or anything like that.

Service restart takes care of the problem.

One hypothesis is that this is related to power cycling. Erik had a power incident and the gw booted fine but the services did not seem to start as per above. PM has frequent power problems and the services do not seem to start always after that.

Let's keep track of this and be noteful if this happens in development.

The symptom is simply logging that stops after a certain time, i.e. services are not running.

Maybe we should have a log heartbeat signal that outputs something to the log. Maybe it is possible to check the status of the services via balena cli.

davmoz commented 7 months ago

Observation: Linus (5513384) has a Huawei and his GW suddenly stopped reading. Upon SSHing in, all containers seemed to be running.

Last output:

[Logs]    [2024-01-24T18:06:58.031Z] [web] DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): testnet.srcful.dev:443
[Logs]    [2024-01-24T18:06:58.320Z] [web] DEBUG:urllib3.connectionpool:https://testnet.srcful.dev:443 "POST /gw/data/ HTTP/1.1" 200 0
[Logs]    [2024-01-24T18:06:58.752Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 30000-35
[Logs]    [2024-01-24T18:06:58.918Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 32064-31
[Logs]    [2024-01-24T18:06:58.920Z] [web] DEBUG:server.tasks.srcfulAPICallTask:Thead is finished: calling _on200
[Logs]    [2024-01-24T18:06:58.921Z] [web] Response: <Response [200]>
[Logs]    [2024-01-24T18:07:00.485Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 30000-35
[Logs]    [2024-01-24T18:07:01.331Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 32064-31
[Logs]    [2024-01-24T18:07:01.331Z] [web] task is in the past adjusting time
[Logs]    [2024-01-24T18:07:01.710Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 30000-35
[Logs]    [2024-01-24T18:07:01.879Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 32064-31
[Logs]    [2024-01-24T18:07:02.806Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 30000-35
[Logs]    [2024-01-24T18:07:03.190Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 32064-31
[Logs]    [2024-01-24T18:07:03.823Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 30000-35
[Logs]    [2024-01-24T18:07:04.430Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 32064-31
[Logs]    [2024-01-24T18:07:04.811Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 30000-35
[Logs]    [2024-01-24T18:07:04.979Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 32064-31
[Logs]    [2024-01-24T18:07:06.123Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 30000-35
[Logs]    [2024-01-24T18:07:06.491Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 32064-31
[Logs]    [2024-01-24T18:07:06.492Z] [web] task is in the past adjusting time
[Logs]    [2024-01-24T18:07:06.871Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 30000-35
[Logs]    [2024-01-24T18:07:07.040Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 32064-31
[Logs]    [2024-01-24T18:07:07.723Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 30000-35
[Logs]    [2024-01-24T18:07:08.096Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 32064-31
[Logs]    [2024-01-24T18:07:08.726Z] [web] DEBUG:server.inverters.InverterTCP:OK - Reading Holding: 30000-35
davmoz commented 7 months ago

use dis https://github.com/kmaork/madbg

davmoz commented 7 months ago

Observation: Linus (5513384) has a Huawei and his GW suddenly stopped reading. Upon SSHing in, all containers seemed to be running.

While ssh in the device and in venv, we could not get a response from the inverter. resp was of type ModbusIOException.

Would be interesting to see if the server would respond to http requests while in this state.

h0bb3 commented 7 months ago

I think #31 is a part of this.

h0bb3 commented 5 months ago

Only wierd thing that happened during this sprint was the test eGW that had very poor wifi connection and did not manage to update properly and semed to get stuck... also the HOST_IP was reported as NULL for some reason.

Poor connectivity is something to look out for and could cause a lot of problems, especially when updating

h0bb3 commented 5 months ago

One eGW (01239d34ba0bb9a601) that just stopped restart fixed it. Nothing strange in the logs. Should have tried to get a response but forgot and just restarted.

[Logs]    [2024-03-28T10:55:42.003Z] [web] INFO:server.tasks.harvest:Response: <Response [200]>
[Logs]    [2024-03-28T10:55:44.697Z] [web] DEBUG:server.tasks.harvest:Harvest took 1164 ms
[Logs]    [2024-03-28T10:55:48.016Z] [web] DEBUG:server.tasks.harvest:Harvest took 1117 ms
[Logs]    [2024-03-28T10:55:51.338Z] [web] DEBUG:server.tasks.harvest:Harvest took 1409 ms
[Logs]    [2024-03-28T10:55:53.869Z] [web] DEBUG:server.tasks.harvest:Harvest took 1122 ms
[Logs]    [2024-03-28T10:55:56.403Z] [web] DEBUG:server.tasks.harvest:Harvest took 1119 ms
[Logs]    [2024-03-28T10:56:03.416Z] [web] DEBUG:server.tasks.harvest:Harvest took 5850 ms
[Logs]    [2024-03-28T10:56:10.381Z] [web] DEBUG:server.tasks.harvest:Harvest took 1115 ms
[Logs]    [2024-03-28T10:56:21.165Z] [web] DEBUG:server.tasks.harvest:Harvest took 1366 ms
[Logs]    [2024-03-28T10:56:30.442Z] [web] DEBUG:server.tasks.harvest:Harvest took 1168 ms
[Logs]    [2024-03-28T10:56:31.852Z] [web] DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): testnet.srcful.dev:443
[Logs]    [2024-03-28T10:56:32.367Z] [web] DEBUG:urllib3.connectionpool:https://testnet.srcful.dev:443 "POST /gw/data/ HTTP/1.1" 200 0
[Logs]    [2024-03-28T10:56:32.711Z] [web] INFO:server.tasks.harvest:Response: <Response [200]>
[Logs]    [2024-03-28T10:56:39.012Z] [web] DEBUG:server.tasks.harvest:Harvest took 1208 ms
[Logs]    [2024-03-28T10:56:46.602Z] [web] DEBUG:server.tasks.harvest:Harvest took 1120 ms
[Logs]    [2024-03-28T10:56:53.511Z] [web] DEBUG:server.tasks.harvest:Harvest took 1120 ms
[Logs]    [2024-03-28T10:56:59.844Z] [web] DEBUG:server.tasks.harvest:Harvest took 1235 ms
[Logs]    [2024-03-28T10:57:05.283Z] [web] DEBUG:server.tasks.harvest:Harvest took 1079 ms
[Logs]    [2024-03-28T10:57:10.568Z] [web] DEBUG:server.tasks.harvest:Harvest took 1327 ms
[Logs]    [2024-03-28T10:57:14.784Z] [web] DEBUG:server.tasks.harvest:Harvest took 1009 ms
[Logs]    [2024-03-28T10:57:18.988Z] [web] DEBUG:server.tasks.harvest:Harvest took 1134 ms
[Logs]    [2024-03-28T10:57:23.832Z] [web] DEBUG:server.tasks.harvest:Harvest took 2305 ms
[Logs]    [2024-03-28T10:57:27.184Z] [web] DEBUG:server.tasks.harvest:Harvest took 1047 ms
[Logs]    [2024-03-28T10:57:28.472Z] [web] DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): testnet.srcful.dev:443
[Logs]    [2024-03-28T10:57:28.985Z] [web] DEBUG:urllib3.connectionpool:https://testnet.srcful.dev:443 "POST /gw/data/ HTTP/1.1" 200 0
[Logs]    [2024-03-28T10:57:29.332Z] [web] INFO:server.tasks.harvest:Response: <Response [200]>
[Logs]    [2024-03-28T10:57:31.422Z] [web] DEBUG:server.tasks.harvest:Harvest took 1137 ms
[Logs]    [2024-03-28T10:57:34.871Z] [web] DEBUG:server.tasks.harvest:Harvest took 850 ms
[Logs]    [2024-03-28T10:57:38.928Z] [web] DEBUG:server.tasks.harvest:Harvest took 1548 ms
[Logs]    [2024-03-28T10:57:41.795Z] [web] DEBUG:server.tasks.harvest:Harvest took 1318 ms
[Logs]    [2024-03-28T10:57:45.202Z] [web] DEBUG:server.tasks.harvest:Harvest took 1938 ms
h0bb3 commented 4 months ago

The main issue atm seems to be the crypto chip that some times stops responding giving an invalid signature and is refused in the backend.

h0bb3 commented 3 months ago

we had one eGW that could not connect to its inverter due to it changing ip. This could be handled with the scan endpoint.