libremesh / lime-packages

LibreMesh packages configuring OpenWrt for wireless mesh networking
https://libremesh.org/
GNU Affero General Public License v3.0
280 stars 96 forks source link

Lime mesh upgrade get_node_status crashes after schedule upgrade #1125

Open selankon opened 1 month ago

selankon commented 1 month ago

After follow the lime app mesh wide upgrade wizard, when click on the step schedule upgrade, the image upgrade scheduled is set but the lime mesh upgrade get_node_status call crashes abnormally:

root@blu:~# echo '{}' | /usr/libexec/rpcd/lime-mesh-upgrade call get_node_status
lua: /usr/lib/lua/lime-mesh-upgrade.lua:488: attempt to compare number with nil
stack traceback:
    /usr/lib/lua/lime-mesh-upgrade.lua:488: in function 'get_node_status'
    /usr/libexec/rpcd/lime-mesh-upgrade:31: in function 'get_node_status'
    /usr/libexec/rpcd/lime-mesh-upgrade:85: in main chunk
    [C]: ?
root@blu:~# ubus call lime-mesh-upgrade get_node_status '{}' 
Command failed: No response

It seems that at least one get_node_status was succes because it show the "upgrade scheduled" status.

It seems that the shared state info don't update any more, the ttl goes down but anything else:

root@blu:~# shared-state-async dump mesh_wide_upgrade
D 1710401297.287 std::task<int> SharedState::merge(const std::string&, const std::map<std::__cxx11::basic_string<char>, StateEntry>&, const sockaddr_storage&, std::error_condition*) mesh_wide_upgrade got 2 significative changes out of 2 input slice size: 2 state size: 2
[
    {
        "key": "blu",
        "value": {
            "mAuthor": "blu",
            "mTtl": {
                "xint64": 1980,
                "xstr64": "1980"
            },
            "mData": {
                "repo_url": "http://10.13.197.30/lros/",
                "candidate_fw": "LibreRouterOs Test",
                "safeupgrade_start_remining": -1,
                "retry_count": 0,
                "upgrade_state": "READY_FOR_UPGRADE",
                "current_fw": "LiMe napoli-network development (napoli-network rev. 06a0edcc 20240514_0801)",
                "main_node": "MAIN_NODE",
                "node_ip": "10.13.197.30",
                "board_name": "tplink,tl-wdr3600-v1",
                "su_start_time_out": 0,
                "timestamp": 1710400813,
                "eupgradestate": "downloaded",
                "safeupgrade_start_mark": 0
            }
        }
    },
    {
        "key": "node11s",
        "value": {
            "mAuthor": "node11s",
            "mTtl": {
                "xint64": 1986,
                "xstr64": "1986"
            },
            "mData": {
                "repo_url": "http://10.13.197.30/lros/",
                "candidate_fw": "LibreRouterOs Test",
                "safeupgrade_start_remining": -1,
                "retry_count": 0,
                "upgrade_state": "READY_FOR_UPGRADE",
                "current_fw": "LiMe napoli-network development (napoli-network rev. 06a0edcc 20240514_0801)",
                "main_node": "NO",
                "node_ip": "10.13.105.28",
                "board_name": "tplink,tl-wdr3600-v1",
                "su_start_time_out": 0,
                "timestamp": 1710400813,
                "eupgradestate": "downloaded",
                "safeupgrade_start_mark": 0
            }
        }
    }
]
javierbrk commented 1 month ago

ok ! this is about safe-upgrade, may be the router does not have safe upgrade installed ? at this point only librerouter v1 has safe upgrade running

488 if (tonumber(utils.unsafe_shell("safe-upgrade confirm-remaining")) > 1) then

We should check for safe upgrade before we continue ... I'll work on that

selankon commented 1 month ago

Mesh upgrade should check if safe upgrade is working, and if not, return an error with the error message on any of the mesh upgrade endpoints.

Also, before downloading the firmware, check if the firmware is compatible (which is actually implemented via eupgrade) and safe upgrade is enabled