golioth / golioth-zephyr-sdk

Golioth SDK For Zephyr
https://www.golioth.io
Apache License 2.0
66 stars 19 forks source link

samples/dfu: clear dfu update_ctx before observing #336

Closed ncmiller closed 1 year ago

ncmiller commented 1 year ago

This fixes an issue where DFU cannot proceed due to:

golioth_dfu: Ignoring new desired firmware, as downloading already started

This issue can occur if DFU gets interrupted by a disconnect. When DFU starts, the downloading_started field of struct dfu_ctx is set to true, but it is never cleared after that point. Upon reconnection, golioth_fw_observe_desired() will be called again, but since the downloading_started flag has already been set, DFU cannot proceed.

Clearing the update_ctx before calling golioth_fw_observe_desired() will ensure the flag is cleared, and will result in the DFU process starting from the beginning if a client disconnects and reconnects.

Signed-off-by: Nick Miller nick@golioth.io

github-actions[bot] commented 1 year ago

Visit the preview URL for this PR (updated for commit ce5684b):

https://golioth-zephyr-sdk-doxygen-dev--pr336-nick-dfu-ctx-cle-p0wnuw1n.web.app

(expires Fri, 10 Feb 2023 00:55:36 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: a389eefadf4b4b68a539327b3459dd66c142cf49

ncmiller commented 1 year ago

Good points.

IMO we should either implement retry logic for every step (e.g. by state machine) or leave the sample as is. Trying to solve one case (especially by introducing undefined behavior because of race condition) should not be an option, hence blocking this PR for now.

Yeah, I don't want to make things more complicated, so I'd sooner leave the sample as is.

For more context, this was a workaround that fixed an issue @szczys encountered on an nRF91 DK (cellular), where he was unable to execute a DFU:

uart:~$ *** Booting Zephyr OS build v3.1.99-ncs1  ***
[00:00:00.501,861] <inf> golioth_system: Initializing
[00:00:00.508,666] <inf> fs_nvs: 2 Sectors of 4096 bytes
[00:00:00.508,666] <inf> fs_nvs: alloc wra: 0, f98
[00:00:00.508,666] <inf> fs_nvs: data wra: 0, d0
[00:00:07.477,905] <dbg> golioth_dfu: main: Start DFU sample
[00:00:07.477,966] <inf> golioth_samples: Waiting for interface to be up
[00:00:07.477,996] <inf> golioth_system: Starting connect
[00:00:08.712,432] <inf> golioth_system: Client connected!
[00:00:09.128,479] <dbg> golioth_dfu: golioth_desired_update: Desired
                                      a3 01 1a 63 dc 36 17 02  78 40 37 65 30 30 30 63 |...c.6.. x@7e000c
                                      34 38 37 61 36 31 36 65  31 32 64 34 64 31 63 32 |487a616e 12d4d1c2
                                      38 62 31 62 66 36 64 35  61 32 37 37 39 35 39 38 |8b1bf6d5 a2779598
                                      62 61 66 35 32 33 61 37  36 35 36 39 66 62 35 33 |baf523a7 6569fb53
                                      61 37 66 39 36 37 35 61  64 39 03 81 a6 01 64 6d |a7f9675a d9....dm
                                      61 69 6e 02 65 31 2e 32  2e 33 03 78 40 35 37 66 |ain.e1.2 .3.x@57f
                                      66 37 62 62 34 37 39 63  30 38 66 30 35 38 31 65 |f7bb479c 08f0581e
                                      63 30 62 30 64 65 37 62  35 63 61 37 63 35 63 34 |c0b0de7b 5ca7c5c4
                                      64 30 64 63 62 35 64 30  31 61 65 39 62 37 34 36 |d0dcb5d0 1ae9b746
                                      64 65 33 32 62 39 34 33  64 65 34 61 31 04 1a 00 |de32b943 de4a1...
                                      04 13 78 05 70 2f 2e 75  2f 63 2f 6d 61 69 6e 40 |..x.p/.u /c/main@
                                      31 2e 32 2e 33 06 67 6d  63 75 62 6f 6f 74       |1.2.3.gm cuboot  
[00:00:09.128,814] <inf> golioth: Manifest sequence-number: 1675376151
[00:00:09.133,941] <inf> golioth_system: Reconnect per request
[00:00:09.133,941] <err> golioth_dfu: Error while receiving desired FW update: -110
[00:00:09.133,972] <err> golioth_dfu: Error while receiving FW data: -110
[00:00:09.134,002] <wrn> golioth: req_sync finished with error -110
[00:00:09.134,033] <err> golioth_dfu: Failed to update to 'downloading' state: -110
[00:00:09.138,366] <inf> golioth_system: Starting connect
[00:00:10.152,832] <inf> golioth_system: Client connected!
[00:00:10.527,496] <dbg> golioth_dfu: golioth_desired_update: Desired
                                      a3 01 1a 63 dc 36 17 02  78 40 37 65 30 30 30 63 |...c.6.. x@7e000c
                                      34 38 37 61 36 31 36 65  31 32 64 34 64 31 63 32 |487a616e 12d4d1c2
                                      38 62 31 62 66 36 64 35  61 32 37 37 39 35 39 38 |8b1bf6d5 a2779598
                                      62 61 66 35 32 33 61 37  36 35 36 39 66 62 35 33 |baf523a7 6569fb53
                                      61 37 66 39 36 37 35 61  64 39 03 81 a6 01 64 6d |a7f9675a d9....dm
                                      61 69 6e 02 65 31 2e 32  2e 33 03 78 40 35 37 66 |ain.e1.2 .3.x@57f
                                      66 37 62 62 34 37 39 63  30 38 66 30 35 38 31 65 |f7bb479c 08f0581e
                                      63 30 62 30 64 65 37 62  35 63 61 37 63 35 63 34 |c0b0de7b 5ca7c5c4
                                      64 30 64 63 62 35 64 30  31 61 65 39 62 37 34 36 |d0dcb5d0 1ae9b746
                                      64 65 33 32 62 39 34 33  64 65 34 61 31 04 1a 00 |de32b943 de4a1...
                                      04 13 78 05 70 2f 2e 75  2f 63 2f 6d 61 69 6e 40 |..x.p/.u /c/main@
                                      31 2e 32 2e 33 06 67 6d  63 75 62 6f 6f 74       |1.2.3.gm cuboot  
[00:00:10.527,526] <wrn> golioth_dfu: Ignoring new desired firmware, as downloading already started

By memseting the update_ctx, he was able to complete the DFU. It's not addressing the root cause of the issue. It's just a workaround that worked for his application, and I thought it might make sense to put that workaround into our dfu sample, since this code is often copy/pasted into user applications.

Another thing is to solve the root cause, which is resulting in reconnections in the early phase of the sample. Do we know why is that happening and how reproducible it is?

The root cause is that Reconnect per request happens immediately after the initial connection (reproducible 100% of time on nRF91 DK). The reconnect happens because when credentials are loaded via golioth_settings_set(), there is an unconditional call to client_request_reconnect(), which causes a reconnect on the first timeout of the main loop in the system_client thread (golioth_system_client_main()). So if a DFU happens to start before the reconnect happens (race condition), then you will run into this issue where the DFU gets interrupted.

I'm okay closing this PR and opening a separate GitHub issue for the Reconnect per request.

Does that sound okay?

mniestroj commented 1 year ago

I have reproduced this issue and already see what is the root cause. @ncmiller Thanks for providing all the information here. I'll have a PR ready tomorrow, so let's wait a bit with any further actions for now.

ncmiller commented 1 year ago

Given that https://github.com/golioth/golioth-zephyr-sdk/pull/338 fixes the root cause of the issue, this PR is no longer needed. Closing.