mycontroller-org / mycontroller-v1-legacy

The Open Source Controller
http://www.MyController.org
Apache License 2.0
148 stars 90 forks source link

Firmware update keeps restarting #476

Open seant100 opened 5 years ago

seant100 commented 5 years ago

The firmware update is restarting what seems to be every 1 minute. From monitoring the resource logs, it seems to all be working fine then suddenly after what seems like 1 minute (every time its around 1 minute), the node sends a Firmware Request about 5 times for the same block. MyController does not send the packet and then the node restarts with a Firmware Config Request.

seant100 commented 5 years ago

@jkandasa I've updated to latest snapshot release and also updated my gateway to latest mysensors library . Now the firmware updates do seem to be working but they are incredibly slow. It seems each block takes 3 seconds as if there is a deliberate delay added.

Screenshot of resource log

image

jkandasa commented 5 years ago

@seant100 it is all about node request. If you use MQTT gateway, you may feel a bit slow. I request you to disable Stream ack enabled in your gateway settings, if enabled

seant100 commented 5 years ago

@jkandasa I am using MQTT and Firmware updates used to work fine - fast enough. Now its incredibly slow and does not update at all and it seems to restart the Firmware update after a few seconds. I.e. I see in the resource logs "Firmware Config Request" / "Firmware Config Response" then a few packets of "Firmware Request" / "Firmware Response" and then it restarts with a "Firmware Config Request/Response"

The "Ack" enabled checkbox and the "Stream Ack" checkbox does not work at all. The logs fill up with errors about maximum retries. So I have to keep those disabled.

One thing that seems to help was to downgrade the MySensors library to 2.2.0 as 2.3.0. This at least gets the packets transmitted faster again and far less restarts of firmware updates.

I am not sure then if this is a MyController issue, a MySensors 2.3.0 issue or perhaps a MySBootloader issue

jkandasa commented 5 years ago

@seant100 At this moment I'm having only dual optiboot bootloader with RFM69HW and MySensors 2.3.0, I do not see any issue with OTA.

seant100 commented 5 years ago

@jkandasa I am using MySBootloader with RF24 radios and updated gateway and a node to MySensors 2.3.0 and this combination seems to not work at all for firmware updates. Downgrading to MySensors 2.2.0 seems to work but I am still testing.

jkandasa commented 5 years ago

@seant100 I guess MySBootloader does not dependent with MySensors library. It is a separate library, My guess MySensors version will not affect MySBootloader. And this library will be active when the node boots/reboots. after a successful boot, control handover to MySensors library.

https://github.com/mysensors/MySensorsBootloaderRF24

seant100 commented 5 years ago

@jkandasa I have enabled debug logging of gateway messages and see that MyController fails to send a packet and hence the firmware update is restarted. See logged messages here (I added some notes between the messages that identify where the issue is)

2018-10-30 11:38:02,448 Rx: mygateway1-out/162/255/4/0/2 [02004C00F701]
2018-10-30 11:38:02,461 Tx: mygateway1-in/162/255/4/0/3 [02004c00f701F280FF2009F48DC0EC01808181114EC0]
2018-10-30 11:38:02,518 Rx: mygateway1-out/162/255/4/0/2 [02004C00F601]
2018-10-30 11:38:02,533 Tx: mygateway1-in/162/255/4/0/3 [02004c00f601DF92EF92FF920F931F93CF93DF93FC01]

NODE REQUESTS PACKET F501 HERE 3 TIMES

2018-10-30 11:38:05,665 Rx: mygateway1-out/162/255/4/0/2 [02004C00F501]
2018-10-30 11:38:08,664 Rx: mygateway1-out/162/255/4/0/2 [02004C00F501]
2018-10-30 11:38:11,665 Rx: mygateway1-out/162/255/4/0/2 [02004C00F501]

MYCONTROLLER NOT SENDING PACKET F501 

2018-10-30 11:38:14,972 Rx: mygateway1-out/162/255/4/0/0 [02004C00C802C3C60103]
2018-10-30 11:38:17,781 Rx: mygateway1-out/162/255/4/0/0 [02004C00C802C3C60103]

FIRMWARE CONFIG REQUEST TO RESTART FIRMWARE 

2018-10-30 11:38:20,025 Tx: mygateway1-in/162/255/4/0/3 [02004c00f5015F926F927F928F929F92AF92BF92CF92]
2018-10-30 11:38:20,784 Rx: mygateway1-out/162/255/4/0/0 [02004C00C802C3C60103]
2018-10-30 11:38:20,904 Tx: mygateway1-in/162/255/4/0/3 [02004c00f5015F926F927F928F929F92AF92BF92CF92]
2018-10-30 11:38:21,013 Tx: mygateway1-in/162/255/4/0/3 [02004c00f5015F926F927F928F929F92AF92BF92CF92]

FIRMWARE RESTARTED

2018-10-30 11:38:21,069 Tx: mygateway1-in/162/255/4/0/1 [02004C00C802AE40]
2018-10-30 11:38:21,160 Tx: mygateway1-in/162/255/4/0/1 [02004C00C802AE40]
2018-10-30 11:38:21,254 Rx: mygateway1-out/162/255/4/0/2 [02004C00C702]
2018-10-30 11:38:21,291 Tx: mygateway1-in/162/255/4/0/1 [02004C00C802AE40]
2018-10-30 11:38:21,397 Tx: mygateway1-in/162/255/4/0/3 [02004c00c702FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
2018-10-30 11:38:24,556 Rx: mygateway1-out/162/255/4/0/2 [02004C00C602]
2018-10-30 11:38:24,577 Tx: mygateway1-in/162/255/4/0/3 [02004c00c602FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
seant100 commented 5 years ago

@jkandasa CPU is periodically hitting 100% on raspberrypi. It does that for a 10+ seconds then drops down again to around 20/30% and stays in that range for a few minutes then back to 100% for probably 10+ seconds. If the time the CPU is 100% is too long then the firmware response is failing to be sent/processed in time. I suspected some sort of metric aggregation kicking in and reduced the "Raw data" retention to 1 minute. That seemed to help a lot but not 100% solve the problem. So now as a work-around I have unplugged the 2 nodes that push data every second (energy monitor node and weather node) and now the firmware update mostly working ... sometimes it is still restarting though.

The end result of my findings here is that something mycontroller is doing is pushing the "java" process to hit 100% cpu usage and if that remains high too long (say more than about 10 seconds) then the firmware updater in MySBootloader times out and firmware update restarts.

jkandasa commented 5 years ago

@seant100 nice finding! Looks like some performance hit on aggregation job or somewhere. Are you running scripts? Do you see any hit from mycontroller.log?

seant100 commented 5 years ago

@jkandasa I did not see anything in mycontroller.log to identify the issue.

I am running some scripts - they should not be doing anything major enough to cause this though. There are 4 scripts. 3 of them run when a distance measurement comes in from a water tank ultrasound sensor and calculates and sets 2 virtual sensors for percentage and litres. The other script runs every 30 minutes - checks some sensor values and determines whether or not to send a turn on command to a pump. So nothing here does any heavy processing justifying a 100% CPU and they don't run often.

I suspected some java worker / background task - possibly data retention / aggregation. Hence trying with busy nodes turned off to reduce the time needed to aggregate any data that came in. This did seem to help reduce the time the CPU hits 100%. And firmware eventually did update after a few attempts with some restarts. When those nodes are turned on, the issue happens quicker in the firmware update progress and firmware is never updated.

I only have the 2 busy nodes. The other nodes only report once every 2 minutes. So that should not generate enough data to justify 10 seconds of CPU at 100% doing the aggregation.

seant100 commented 5 years ago

@jkandasa Perhaps one option is to not do any aggregation jobs when a firmware update is in progress !

jkandasa commented 5 years ago

@seant100 disabling aggregation job might be super expensive when we enable it back. I will find out other ways. Is it possible to send your database backup file? I want to check the available index.

seant100 commented 5 years ago

@jkandasa I have emailed the database backup to you