thingsboard / thingsboard-client-sdk

Client SDK to connect with ThingsBoard IoT Platform from IoT devices (Arduino, Espressif, etc.)
MIT License
162 stars 125 forks source link

OTA fails #132

Closed vijayflashbulb closed 1 year ago

vijayflashbulb commented 1 year ago

hi we are 0.9.7 version of the Arduino library on esp32-s3. With the example code the device downloads the OTA just fine(when it is connected to the high speed wifi), but it fails in between the downloading process(when it is connected to a slow connection) and returns unable to download firmware on the serial monitor. can it be solved by making some changes in the code. thanks!

MathewHDYT commented 1 year ago

The download of each chunk is retried 5 times, it can help to increase this number when using a inconsistent or weak connection.

Example Line that needs to be edited.

Increase it to the max value of 256 for now. Furthermore, decreasing the Packet Size might help as well. I would try 1024 for now.

vijayflashbulb commented 1 year ago

thanks for quick reply will try increasing the number of retries

vijayflashbulb commented 1 year ago

it is still unable to download the OTA even after increasing the number of retries.

MathewHDYT commented 1 year ago

How big is the binary file you are trying to download and can you try decreasing the packet size to 1024 instead of 4096.

Furthermore you use the OTA example right from this library right or did you modify it, if you did can you attach the code as well.

Additionally can you attach the console output when the OTA update seems to fail and does the device not contain the new firmware at the end when restarting.

Because it could be that the problem is simply that the cloud state stays on downloading, that is because the updated state is expected to be sent after the device has been restarted and has successfully connected to the internet and Thingsboard, because that state would be the information that the update was completely successful.


Additionally stupid mistake on my part, you need to change the FIRMWARE_FAILURE_RETRIES to 255 not 256, because that is the uint8_t maximum value, that is because it includes 0 (0 - 255), if you insert 256 it will wrap back to 0 instead.

vijayflashbulb commented 1 year ago

serial there are no modifications done in the example code. size of binary file is 1.6Mb changed FIRMWARE_FAILURE_RETRIES to 255

MathewHDYT commented 1 year ago

The expected checksum, seems invalid, so the one received from ThingsBoard. Perhaps that might be an error as well but the download doesn't even completely finish that is definitely the more pressing issue for now.


The download log as well the error you received does make sense. This error message occurs if the device hasn't gotten a response fast enough from the server with the predefined timeout time.

That is 3000 milliseconds so 3 seconds, try to increase this number as well for now let's try 60000 milliseconds so a whole minute just for testing purposes.

To do that simply replace this line in your example:

const OTA_Update_Callback callback(&progressCallback, &updatedCallback, CURRENT_FIRMWARE_TITLE, CURRENT_FIRMWARE_VERSION, FIRMWARE_FAILURE_RETRIES, FIRMWARE_PACKET_SIZE);

with this line

const OTA_Update_Callback callback(&progressCallback, &updatedCallback, CURRENT_FIRMWARE_TITLE, CURRENT_FIRMWARE_VERSION, FIRMWARE_FAILURE_RETRIES, FIRMWARE_PACKET_SIZE, 60000);

And also I would like to know did the increased failure retries and decreased packet size help in getting further into the download or does it still fail around the same point as previously?

vijayflashbulb commented 1 year ago

i changed the bin file with the other one of the same size, just to make sure there is no issue with the file. serial2 but issue is same even after increasing the time to 60000 and increasing the number of retries doesn't help

serial1 got another issue where the size of the received chunk is 0

MathewHDYT commented 1 year ago

That is rather confusing, I'm not really sure what the issue could be exactly.

Because the slower internet does seem to cause problems, but I'm still confused by your checksum as well as that it seems to fail around the 30% mark.

What binary are you using exactly and for testing purposes? Would it be possible upload the binary of the example you are testing the OTA update with to ThingsBoard and try to download that over OTA.

So instead of using your current binary file upload the one from the build folder of the project containing the OTA example code and download that instead.


If this still doesn't work, we'll probably have to add some logs and try to discern the exact moment and reason why it fails.

vijayflashbulb commented 1 year ago

used bin file is created from "export compiled binary" from sketch menu

but i don't think the bin file has any issue because it gets downloaded successfully when the esp is connected to high speed connection.

MathewHDYT commented 1 year ago

If you mean high speed connection do you simply mean how far it is from the access point, so how stable the connection is, or do you actually refer to the WiFi speed (2.4Ghz vs. 5Ghz).


The only thing I can imagine for the complete failure at around 30% is perhaps because you more or less loose complete wifi connection and do not reconnect successfully again before the timeout occurs, because 1 minute would be a pretty long grace period for downloading 1kb of data.

Can you add this code in the loop method.

void loop() {
    delay(100);

    Serial.println(WiFi.RSSI());

    delay(900);

    // Other code goes here
}

This will help debug the actual connection strength there should be a significant drop around when the OTA update fails.

vijayflashbulb commented 1 year ago

serial3

download stopped at 38% with slow wifi i mean a mobile hotspot

MathewHDYT commented 1 year ago

The image isn't optimal because it starts after the SHA-256 comparison failed. Can you add the text from before the comparison fails.

But in general -47 RSSI would be very good if not a great connection. So it is weird that it causes issues.

vijayflashbulb commented 1 year ago

//connected to some other wifi

Progress 55.78% [TB] Callback onMQTTMessage from topic: (v2/fw/response/0/chunk/921) [TB] Receive chunk (921), with size (1024) bytes Progress 55.84% [TB] Unable to request firmware chunk Done, Reboot now ESP-ROM:esp32s3-20210327 Build:Mar 27 2021 rst:0xc (RTC_SW_CPU_RST),boot:0x8 (SPI_FAST_FLASH_BOOT) Saved PC:0x4209bed6 SPIWP:0xee mode:DIO, clock div:1 load:0x3fce3808,len:0x43c load:0x403c9700,len:0xbec load:0x403cc700,len:0x2a3c SHA-256 comparison failed: Calculated: 74cb8a0835ff948a23b1fa30f5641bd3d3fc50d3e848150763aab2d791fb9d34 Expected: ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff Attempting to boot anyway... entry 0x403c98d8 Connecting to AP ... .Connected to AP WiFi RSSI = -66 Connecting to: (dashboard.livair.io) with token (tJyjRl7RczDySiUA4qam) Firwmare Update... [TB] Requesting shared attributes transformed from (fw_checksum,fw_checksum_algorithm,fw_size,fw_title,fw_version) into json ({"sharedKeys":"fw_checksum,fw_checksum_algorithm,fw_size,fw_title,fw_version"}) WiFi RSSI = -66 WiFi RSSI = -66 WiFi RSSI = -66 WiFi RSSI = -66 WiFi RSSI = -66 WiFi RSSI = -66 WiFi RSSI = -65 WiFi RSSI = -66 [TB] Callback onMQTTMessage from topic: (v1/devices/me/attributes/response/1) [TB] Received shared attribute request [TB] Calling subscribed callback for response id (1) [TB] {"fw_checksum":"deabff3c2f930e1b597bdcb16b1d9a21","fw_size":1689904,"fw_title":"LIVAIR","fw_checksum_algorithm":"MD5","fw_version":"1.1.1"} [TB] ================================= [TB] A new Firmware is available: [TB] (1.0.0) => (1.1.1) [TB] Attempting to download over MQTT... Progress 0.06% [TB] Callback onMQTTMessage from topic: (v2/fw/response/0/chunk/1) [TB] Receive chunk (1), with size (1024) bytes [TB] Error during Update.write [TB] Callback onMQTTMessage from topic: (v2/fw/response/0/chunk/0)

MathewHDYT commented 1 year ago

Weird I would expect the RSSI to be printed from time to time before the update has failed as well, for now can you add the log message you added in the loop to the OnProgressCallback method and print it in that method.

pablo18393 commented 1 year ago

Hi @vijayflashbulb , I also had that issue. I am performing OTA in ESP32 through 2G (very slow, around 10 mins of download) Debugging the code I realized that after 5 mins aprox the device is disconnected from thingsboard (don't know why). I tweaked the code a little bit to reconnect in the middle of the firmware update. Here are the modifications in the thingsboard.h:

  1. First of all, I declared 2 global variables (TB_host and TB_token) for host and token and stored the values from the server:
    const char* TB_host;
    const char* TB_token;
    // Connects to the specified ThingsBoard server and port.
    // Access token is used to authenticate a client.
    // Returns true on success, false otherwise.
    inline const bool connect(const char *host, const char *access_token = PROV_ACCESS_TOKEN, int port = 1883, const char *client_id = DEFAULT_CLIENT_ID, const char *password = NULL) {
    if (!host) {
    return false;
    }
    TB_host = host;
    TB_token = access_token;
    }

2- I modified the "loop" so it returns false when it is disconnected:

bool loop() {
  return (m_client.loop());
}

3- (In funcion Firmware_Shared_Attribute_Received): last of all, reconnect during firmware update if it is disconnected from server:

    // Download the firmware
    do {
      Serial.println("Requesting chunk " + String(currChunk) + " of " + String(numberOfChunk));
      char topic[detect_size(FIRMWARE_REQUEST_TOPIC, currChunk)];  // Size adjuts dynamically to the current length of the currChunk number to ensure we don't cut it out of the topic string.
      snprintf_P(topic, sizeof(topic), FIRMWARE_REQUEST_TOPIC, currChunk);
      snprintf_P(size, sizeof(size), NUMBER_PRINTF, chunkSize);
      m_client.publish(topic, size, m_qos);
      const uint64_t timeout = millis() + 10000U;  // Amount of time we wait until we declare the download as failed in milliseconds.
      do {
        delay(5);
        if(!loop()){
          Logger::log("Disconnected from server, reconnecting...");        
          connect(TB_host, TB_token);
        }

I am not a code expert so I assume it can be optimized. Hope it helps myfriend

MathewHDYT commented 1 year ago

@pablo18393 Thanks a lot for your help I will include your changes in a commit and try them myself to see if the update works well. I will mention this issue once a Pull Request has been created.