OTA question - Githubissues

diogoribeiro23 commented 4 years ago

Hello @tgsong ! sorry for creating this issue but I would like to ask you a couple questions about the amazon-freertos OTA (more specifically your branch ota/refactorJobParsing) if you didn't mind answering, and this was the only way I found to get in contact directly with you. Can we exchange emails? or you prefer to speak here? (if you actually are available to enlighten me :) ) Thanks in advance!

tgsong commented 4 years ago

Sure, we can discuss here if you don't mind being public. But what brought you here instead of the official repo, https://github.com/aws/amazon-freertos? And you're talking about this branch? https://github.com/aws/amazon-freertos/tree/ota/refactorJobParsing

diogoribeiro23 commented 4 years ago

No problem for me. I wanted to speak with you just to see if you could give me some advice about a problem I am having with the OTA and did not want to create an issue in the official repo for that. And yes, I am talking about that branch. I saw that branch and it triggered my curiosity on the purpose of the branch as I am having problems canceling a OTA job and I tracked the issue down to something resembling the things you are doing but I am not sure. So the first question is: are you solving any problem with this branch regarding canceling OTA? My issue is this: when I am in the middle of an OTA and I send the cancel job event, it is in the WaitingForFileBlockstate and receives an Unexpected event which is eOTA_AgentEvent_ReceivedJobDocument which leads to the OTA Agent into the prvHandleUnexpectedEvents thus not stopping the OTA although it is considered canceled in the AWS IoT Core. I was just hoping you could shine some light on this. I am using the master branch but maybe I should use the last release as the aws_iot_ota_agent is completly different. Thanks for hearing me! :)

tgsong commented 4 years ago

No, I'm not fixing any issue or implementing features on that branch, it's merely refactoring to reduce the code complexity and make it easier to maintain.

To cancel the OTA, typically you just need to call the public API OTA_SetImageState with eOTA_ImageState_Aborted , OTA agent will close the file handle and update the job status on the service side. After this, it will go to the "waiting for job" state. So how are you sending the cancel event? Are you doing it in the demo/application code or modifying some of the agent code?

diogoribeiro23 commented 4 years ago

But that is if you know that you want to cancel the OTA right? what do you check to cancel or not the OTA? I am canceling the event in the AWS IoT Core -> Manage -> Jobs which should update the OTA info and automatically cancel it but it doesn't. I have my own application and i'm using the amazon-freertos as a sub-module so i did not change any code on that. I know for sure that this works in the older version (any of the releases, where the OTA agent code is completly different) as it checks if the event received is job related before anything else. How would you cancel the OTA (without previously knowing, so you couldn't just call OTA_SetImageState directly) ? And I am actually using the release branch and I was refering to the v.1.4.9 tag when I said that code works for my issue.

tgsong commented 4 years ago

So you're saying the job is canceled on the service side? Actually as far as I can remember, our OTA agent implementation doesn't check job status in the middle of a download, whether before 1.4.9 or after that. But we do have a plan to implement it in the upcoming months. Right now the device is expected to timeout and go back to waiting for job state. I can double check the current version and 1.4.9 to confirm and get back to you.

diogoribeiro23 commented 4 years ago

So I was able to test my application with the OTA agent v1.4.9 and it actually does not stop the OTA :\ I thought it did as I had an older firmware version that used that amazon-freertos v1.4.9 version but I don't work on that version anymore for some months now so I did not remembered 100%, my bad! Also I was pretty sure as I found this issue regarding a similar scenario. I thought the if condition in the line 1399 followed by the if conditon in the line 1410 of the aws_ota_agent.c did check for that (in version 1.4.9). I would greatly appreciate if you could test/confirm it as I needed to mix the MQTT v2 with the OTA agent v1.4.9 for my app to work (as I am using the MQTT v2 in the release branch) and in the issue I mentioned they use the MQTT v1. What do you exactly mean by the device is expected to timeout and go back to waiting for job state ?

tgsong commented 4 years ago

That if does not check the job status, it is merely checking if the MQTT message was coming from the job notification topic, which we expect to be the job document.

Regarding to the behavior when canceling the job from service side, here's what I found (note the difference between job and job execution),

When the job execution is not started, canceling the job will cancel the job execution associated with it, device will receive a notification at $aws/things/<thing_name>/jobs/notify-next. We don't need to handle this case on device side since nothing is started. (An OTA job can have multiple job executions, each job execution corresponds to an AWS IoT thing, when the device is updating the job status, it is actually updating this "job execution" status)
If job execution status is already changed to IN_PROGRESS, canceling the outer job will not affect it, device is still able to finish the download, update the job execution status to "SUCCEED" or "FAILED".
If job execution status is already changed to IN_PROGRESS, and you force cancel the job execution (a regular cancel will give you an error, see this link, check the description under --force), device will then not able to update the job execution status, it will receive a response from $aws/things/<thing_name>/jobs/<job_id>/update/rejected. We are not handling this case now. So under this case, the current behavior is that device will keep failing to update the status until it times out (after certain retries), then it will go back to normal state waiting for next job.

diogoribeiro23 commented 4 years ago

In the release version, when I cancel the job in the AWS IoT Core the OTA agent gets an event with a job document which is ignored as there is no handler for this case. Tomorrow I'll be able to confirm the cancel behaviour I'm having with the 1.4.9 (also making sure to try the steps you referenced in the 3rd point above) and I'll let you know.

The behaviour you described is for both release and 1.4.9 versions? Or when you say "We are not handling this case now" you mean you are not handling that case on the new (release) version but you did in the old one (1.4.9)?

What do you have to say regarding the issue I referenced?

Regarding your explanation in the 3rd point, you mean the current OTA agent behaviour is to stop receiving data blocks and time out? Or only times out after the full download?

Again, thank you so much for helping me out with this!

diogoribeiro23 commented 4 years ago

I was now able to test the 1.4.9 version and as I suspected initially, the force cancel on the AWS IoT Core stops the OTA update. This happens on the prvSetImageStateWithReason and prvOTA_Close in the lines 1408-1409 inside the if's I mentioned above. There's also a prvUpdateJobStatus that runs after every data block received in the line 1505.

So if you are saying this does not work on the new (release) version (as I think it doesn't), my only doubt is which should be the expected behaviour exactly when you force cancel a job execution.

tgsong commented 4 years ago

Actually I believe that's not the intent, the issue #77 you mentioned above is to handle the scenario where a job is canceled and a new update is created, basically that if statement is saying, "if we receive a job document and but already have an active job, cancel it". What we actually want is, if we receive a notification telling us that the job is canceled on service side or we want to update the job execution status but rejected by the service because job is already canceled, then abort the current update. It happened to work only because if you force cancel the job on the service side, the notification is sent to the same MQTT topic that we're listening for job document.

diogoribeiro23 commented 4 years ago

Ok, I get what you are saying regarding the issue, I just referenced it to let you know that other people were able to cancel the OTA as well.

This "It happened to work only because if you force cancel the job on the service side, the notification is sent to the same MQTT topic that we're listening for job document." is really close to the first case in bold, no? So all this in the version 1.4.9 is just an "happy" coincidence?

tgsong commented 4 years ago

I believe so, and I also don't think the behavior is correct because with 1.4.9 if you're not canceling the job but instead just create a new job, the device should finish the current one and then pick the new job. But from what I see in code it will cancel the current one and starts the new one (I didn't actually run it though).

Anyway, it's something on our radar, I don't have date but we do plan to implement it soon, together with more feature supported on the service side.

diogoribeiro23 commented 4 years ago

That's great to hear!

So, last question just to get it clear: What should happen if I force cancel a job execution on the current version (release)?

tgsong commented 4 years ago

the download will finish (whether using the streaming service via MQTT or through S3 via HTTP), however, device will not able to update the job execution status, so it won't complete the final step, after certain amount of retries, it will go back to normal waiting for job state.

diogoribeiro23 commented 4 years ago

Ok! I never actually tried to finish a canceled job so I'll try that tomorrow and get back at you. If it works like that, it is fine by me :)

diogoribeiro23 commented 4 years ago

So I tried what you said and it did not timeout on updating the job execution status, actually it updated just fine (weirdly). Here is the log:

4940 909408 [OTA Agent Task] [prvIngestDataBlock] File receive complete and signature is valid.
4941 909409 [OTA Agent Task] [prvStopRequestTimer] Stopping request timer.
4942 909410 [OTA Agent Task] [prvUpdateJobStatus_Mqtt] Msg: {"status":"IN_PROGRESS","statusDetails":{"self_test":"ready","updatedBy":"0x2010002"}}
4943 909411 [OTA Agent Task] [INFO ][MQTT][909411] (MQTT connection 0x2000d998) MQTT PUBLISH operation queued.
4944 909412 [OTA Agent Task] [INFO ][MQTT][909412] (MQTT connection 0x2000d998, PUBLISH operation 0x20014540) Waiting for operation completion.
4947 910414 [OTA Agent Task] [INFO ][MQTT][910414] (MQTT connection 0x2000d998, PUBLISH operation 0x20014540) Wait complete with result SUCCESS.
4948 910415 [OTA Agent Task] [prvUpdateJobStatus_Mqtt] 'IN_PROGRESS' to $aws/things/ef2584f7-4463-59b6-b8df-110f76414466/jobs/AFR_OTA-primary_m0_v2_1_8_bin-DDM_Hardware_v4-20200602_141003/update
4949 910416 [OTA Agent Task] Received eOTA_JobEvent_Activate callback from OTA Agent.

I think this is the "update the job execution status" you were referring to, right?

What is happening is that on processing the last data block inside the prvProcessDataHandler it then goes inside the if's in the respective order: line 998 -> 1007 and then calls the xCompleteCallback in the line 1037 which activates the image. I think it keeps this process even if the update job execution status fails.

Am I doing anything wrong?

tgsong commented 4 years ago

Sorry my bad, I wasn't explaining the whole picture. If you force cancel the job while it's already in process, the device will finish the download, reboot into self test mode (activate the new image), then rollback to previous image because it cannot update the job execution status. This is the case as of release 202002.00.

However on master branch we just added OTA suspend and resume feature, which is going into next release. So now OTA agent will suspend itself if it cannot reach to IoT core (typically because of device disconnected from network, but also could be job is canceled). Then device will try to reconnect to IoT core, and resume the download. In this case, if device did not finish the download before it detects it cannot reach to IoT core (via MQTT ping every certain amount of time), it will go to suspend state, then reconnect, and try to get the job document again, but obviously it couldn't since the job is already canceled. Then it will abort and go back to listening state.

Overall, the behavior is quite complicated right now. But when we add the support to detect job execution status change, it should be much more simpler. Device should simply abort current update right after it receives the notification.

diogoribeiro23 commented 4 years ago

Ok now that makes more sense!

I think I'll just wait until those features reach the release branch, but at least I understand a lot better the overall behavior which is nice :)

Looking forward to those changes! Thank you so much for helping me out!

tgsong / amazon-freertos-examples

OTA question #4