prplfoundation / prplMesh

This repository moved to https://gitlab.com/prpl-foundation/prplmesh/prplMesh
Other
65 stars 31 forks source link

[BUG] Certification: slave reset soon after being operational #1438

Closed rmelotte closed 4 years ago

rmelotte commented 4 years ago

A lot of certification tests failed in the recent nightly runs. By looking at the sheet that holds the nightly tests results, the bug was not yet present in the tests from 6/10/2020 9:33:08 (https://gitlab.com/prpl-foundation/prplMesh/-/jobs/588529913).

It was present in the tests from 6/11/2020 8:51:54 (https://gitlab.com/prpl-foundation/prplMesh/-/jobs/590350508).

The logs always show multiple slave_reset soon after the slave is operational. From agent_wlan0.log:

DEBUG 06:38:38:946 <2002205584> son_slave_thread.cpp[2137] --> send ACTION_BACKHAUL_CLIENT_ASSOCIATED_NOTIFICATION for client 00:50:43:22:56:5f
DEBUG 06:38:38:948 <2002205584> son_slave_thread.cpp[2160] --> non-prlMesh, not adding ClientAssociationEvent VS TLV
DEBUG 06:38:39:231 <2002205584> son_slave_thread.cpp[219] --> monitor socket disconnected - slave_reset()
DEBUG 06:38:39:231 <2002205584> son_slave_thread.cpp[138] --> slave_reset() #1 - start
DEBUG 06:38:39:232 <2002205584> son_slave_thread.cpp[3654] --> removing backhaul_manager_socket
DEBUG 06:38:39:232 <2002205584> son_slave_thread.cpp[3665] --> removing platform_manager_socket
DEBUG 06:38:39:240 <2002205584> son_slave_thread.cpp[3672] --> hostap_services_off() - done
INFO 06:38:39:241 <2002205584> son_slave_thread.cpp[3683] --> fronthaul stop
DEBUG 06:38:39:242 <2002205584> son_slave_thread.cpp[163] --> goto STATE_INIT
DEBUG 06:38:39:244 <2002205584> son_slave_thread.cpp[168] --> slave_reset() #1 - done
rmelotte commented 4 years ago

For reference, the following commits were merged between the 2 jobs:

5b7f439bec4b28ced61883d16c79f738128762da - test_flows: add test for wireless onboarding through CAPI
4fd1e134de606e7e89862b22198f742951b9e9d6 - bwl: dummy: sta_wlan_hal: implement update_status
1faec437661809381f3ade30db4b82776c14d6bd - bwl: dummy: sta_wlan_hal: remove unused fields
d7cb4b6976fb4f242ef8d248d4a2c6f7bb2b60be - bwl: dummy: sta_wlan_hal: properly initialize base_wlan_hal
0bf47542b18a208e04790b370d40b1f56b637bd8 - agent: ucc_listener: add support for wireless backhaul
a82eb354b6dc8c5063752492baa4c068e4fa66d6 - bpl: linux: add config for station interface
ccc4589e8dae16e9e4c77f4a74b39c8147af1745 - gitlab-ci: add jobs for wireless backhaul tests on rax40
eb8a35148438165b947360a44a017bde0bcf836e - test_flows: add dev_reset_default test
c8ff30162285ee91ebe2e0319dee2623cb79ac5e - test_flows: add check_no_cmdu
b3a62b451cd389e6a4843bb8b3f09b9ba09aa6af - tests: move capture filtering to sniffer
9e5edba2a6e05f989a0c3cdda0bcd1f7d9ff07d4 - bwl: update dwpal_bw_to_beerocks_bw
1c4caa227374f0bdada3aec583b450aa9a54c5db - common: Change print condition
bc0fc7963b84390ab360d5adf41137a442cedefd - easylogging: Syslog adaptations
9fd7de891a245d27f0a3e3f0d59c22ae1bfaabf5 - common: Add an option to set thread name on beerocks logging
d36dda29f05ead7ee01d5f2cf207fb07dd575cd9 - cppcheck: Update exist issues file
5d985b1db718bc30fa31090137805cd138d84ed5 - tools: Add cppcheck suppressions list
61b4124263ffcfd27448ff00aae2cd076c1f78f0 - tests: environment.py: support multiple log files
d422e7acda934144d1f3c1af1d89205e4bf47c21 - agent: Attach the Monitor thread ID to the Monitor logger
4a915a5abd4bd1158b4e70c4661c88eff5c22b4d - common: Wrap option to attach logger ID to thread ID on beerocks logging
54493830ddcc4b0b748d2bc9ef67b0541da0c47c - easylogging: Add option to attach thread ID to logger ID
98123b5a679098d17bb84d0472a9790a4b6866d5 - agent: Activate custom logging on monitor files
3c5a4bd9aae5f1086534819f3b311d7a84d52549 - common: Add custom logger ID helper file
8e7299364a305669ed77ee4c30240f9c36dee60f - agent: Move sending of ACTION_MONITOR_JOINED_NOTIFICATION
c2193327513332cdf5c7a39df82207d279ca120d - common: Change initialization of socket thread m_select_wake_up_time
0ebdbd50ad9b23b5223fcdacfb8f27ec9b88f663 - tlvf: Update auto-generated files
fabaf05ca191865d79737b87aea0d23c7352c4c1 - tlvf: Remove APMANAGER_INIT_DONE_NOTIFICATION
6cabc74d2cd5a174cbcfb9c375e2211128fcde67 - agent: Remove ap_manager and monitor stop and start functions
e9a95ea5945a5f0c7e78870971964b09a8b9d53a - agent: Change son_slave FSM to new ap_manager init sequence
ff328a056de8ce2daf5eb581fcb24f359d84cdb8 - agent: Add stop and start function to fronthaul on son_slave
3e93d470950681d400eb18b9faac89426e890680 - agent: Fill ap_manager ATTACHED, ATTACHED and OPERATIONAL states
2867da3165cc252f531e3030ca9bfc7fe28c7ccc - agent: Add ACTION_APMANAGER_CONFIGURE message handler
954471ae06b398dbcf620daa0acd1bad3c9cb297 - agent: Add handler to ACTION_APMANAGER_UP_NOTIFICATION on son_slave
a2178d967be35257a8801381bdeff89feef13fdb - agent: Fill INIT and WAIT_FOR_CONFIGURATION on ap_manager FSM
bdc7008a31ebf022c8a9931a1b3998180fcd4f66 - agent: Add connect_to_agent() to ap_manager
bc96bd7b359c9a46cf0b07a9eacc1035a5acf78b - agent: Add new messages to ap_manager
d406402bbc4946a8aca6c175dcf0878b2778b0af - agent: Bring up ap_manager on fronthaul main file
5d2b3d7e7a43eb24ca5e80c9006a6b41c9e29fa3 - agent: Add FSM skeleton to ap_manager
db27bdc377cfb1eec38feb7ee6119eec373fb538 - tlvf: Remove messages related to old 4addr connection flow
788976273ad0c7cb9f521481ef8eb87d20a404a5 - agent: Remove 4addr pending station mechanism
a67ba1bc6b415532a5731c3ea08b24cceac52173 - agent: Modify ap_manager constructor
33c7153245345677dc0332c4b9c4bf1fa26c7950 - agent: Initialize ap_manager and monitor logger instances
c80b178f3c41c4cec1996046440b71ece845f40e - common: Add support to multiple logger IDs on beerocks_logging
11b46be8153d2a08d1181ac2c6e1969225cdffe8 - agent: Rename "monitor" to "fronthaul" on fronthaul main file
1d5968e117ccb1563a20538a31c66db973348d86 - agent: Make the Agent compile successfully
531185360d5ec0be87fd89ca67ddb05055b6bfca - common: Change executable name on prplmesh_utils.sh
c8196ca42fbd33f09884f6b62a2d335e9830b16a - cmake: Modify monitor CMakeList file
d4aebbcc9cb0dcf80fa54c3aba7db7591f8bfe3c - agnet: Move monitors cmakelist and main.cpp file to upper directory
5db0933cd442e15f867a95ac698768c71094ca0b - cmake: Change add_subdirectory call to "fronthaul_manager"
e44741ce0d612b23237239ffda498ffbbafe717e - agent: Initial commit
rmelotte commented 4 years ago

It corresponds to the following 3 PRs: https://github.com/prplfoundation/prplMesh/pull/1271 https://github.com/prplfoundation/prplMesh/pull/1411 https://github.com/prplfoundation/prplMesh/pull/1401

rmelotte commented 4 years ago

It looks like it already happens with #1271 alone: https://gitlab.com/prpl-foundation/prplMesh/-/jobs/587619917

rmelotte commented 4 years ago

There is at least this problem whenever a station connect (the fronthaul segfaults):

[  859.269678] do_page_fault(): sending SIGSEGV to beerocks_fronth for invalid write access to 76fc1ee8
[  859.277410] epc = 77b98fe8 in libc.so[77b2d000+98000]
[  859.282382] ra  = 77b9acf4 in libc.so[77b2d000+98000]
arnout commented 4 years ago

It doesn't only happen when a station connects, cfr. this simple "is prplmesh up" test which crashes beerocks_fronthaul in a similar way. In fact, fronthaul is restarted two times before we give up:

[57766.511394] do_page_fault(): sending SIGSEGV to beerocks_fronth for invalid write access to 76e34ee8
[57766.519253] epc = 77a0bfe8 in libc.so[779a0000+98000]
[57766.524093] ra  = 77a0dcf4 in libc.so[779a0000+98000]
[57779.647619] do_page_fault(): sending SIGSEGV to beerocks_fronth for invalid write access to 76c98ee8
[57779.655367] epc = 7786ffe8 in libc.so[77804000+98000]
[57779.660322] ra  = 77871cf4 in libc.so[77804000+98000]
[57791.399615] do_page_fault(): sending SIGSEGV to beerocks_cli for invalid read access from 0000001b
[57791.407160] epc = 76f07a48 in libbml.so.1.4.0[76ecd000+b6000]
[57791.412839] ra  = 76f07a48 in libbml.so.1.4.0[76ecd000+b6000]

Oh, sorry, the last one is a crash of beerocks_cli, ignore it.

arnout commented 4 years ago

Silly me, it actually is a station that connects:

DEBUG 15:56:15:156 <monitor> mon_wlan_hal_dwpal.cpp[1036] --> cmd:STA-FIRST wlan0.1, reply: EMPTY
TRACE 15:56:15:156 <monitor> monitor_thread.cpp[330] --> mon_stats.start()
TRACE 15:56:15:157 <monitor> monitor_thread.cpp[338] --> mon_rssi.start()
DEBUG 15:56:15:157 <monitor> monitor_thread.cpp[360] --> Monitor attach process finished successfully!
TRACE 15:56:15:158 <monitor> monitor_thread.cpp[1950] --> Received STA_Connected event
INFO 15:56:15:159 <monitor> monitor_thread.cpp[1956] --> STA_Connected: mac=4c:7c:5f:24:77:21 vap_id=

(last lines of the fronthaul log)