nkolban / esp32-snippets

Sample ESP32 snippets and code fragments
https://leanpub.com/kolban-ESP32
Apache License 2.0
2.37k stars 710 forks source link

Bluetooth stack "dies" during disonnect/reconnect #209

Closed Sepp62 closed 6 years ago

Sepp62 commented 7 years ago

I adopted the ble_uart sample in one of my programs and noticed that repeated disconnect/reconnect cycles of my client result in malfunction of the bluetooth connection.

I tried to reproduce the behaviour with the pure BLE_uart-sample delivered with ESP_BLE-library version 0.4.6 and succeeded. The only thing I changed was to change delay(1000) to delay(10) in the loop() and a debug output in "connect" and "disconnect" callbacks.

The only chance to recover from this state is a reboot.

As a client I use "Serial bluetooth Terminal 1.11" for Android.

I attached the output of Serial-channel. It shows "unkown connection id" messages, even before the "disconnect" event occurs. --> ble_uart_output.txt

chegewara commented 7 years ago

There is not much info in this file except 2 errors at the end. E (98214) BT: Unknown connection ID: 3 fail sending notification As you can see this error code is from esp-idf stack and not from esp32-snippets.

Where did you get from library version 0.4.6?

Sepp62 commented 7 years ago

I installed it 2 days ago with ESP32-Arduino environment. I also managed to reproduce the behaviour with another Android client. My board is a "Adafruit ESP32 feather".

Theses are the library properties.

_name=ESP32 BLE Arduino version=0.4.6 author=Neil Kolban kolban1@kolban.com maintainer=Neil Kolban kolban1@kolban.com sentence=BLE functions for ESP32 paragraph=This library provides an implementation Bluetooth Low Energy support for the ESP32 using the Arduino platform. category=Communication url=https://github.com/nkolban/ESP32_BLE_Arduino architectures=esp32 includes=BLE.h, BLEUtils.h, BLEScan.h, BLEAdvertisedDevice.h_

chegewara commented 7 years ago

Ok. Can you try this https://github.com/nkolban/esp32-snippets/blob/master/cpp_utils/ArduinoBLE.md#replacing-the-version-that-comes-with-arduino-esp32

Sepp62 commented 7 years ago

I read this article before but I had difficulties to build the ESP32_BLE.zip file. I do not have the IDF toolchain installed and did not find a way to get an existing version of ESP32_BLE.zip.

chegewara commented 7 years ago

Here is latest version: https://dsm.chege.eu.org/fbsharing/tfMCSlLH

Sepp62 commented 7 years ago

Thanks for the quick delivery !

Looks better, but after approx. 10-15 disconnect/reconnect cycles the connection setup failed again (reboot required). It seems as if probabilty of occurrence is related to the "character rate".

Sepp62 commented 7 years ago

Huh, it's strange. I cannot enable logging. I increased the log level in sdkconfig, tried to put a #define with a new log level into my sketch and even called the function "esp_log_level_set", but nothing changes in my output.

I also tried to find fragments from this error message "E (26372) BT: Unknown connection ID: 3 fail sending notification" in the source code under "Arduino\libraries\ESP32_BLE" and "Arduino\hardware" but nothing of it appears in any source file.

Where does this message come from ?

chegewara commented 7 years ago

Its from esp-idf stack BT: Unknown connection ID

Sepp62 commented 7 years ago

yes, comes from here: \esp-idf\components\bt\bluedroid\bta\gatt\bta_gatts_act.c void bta_gatts_indicate_handle (tBTA_GATTS_CB p_cb, tBTA_GATTS_DATA p_msg)

Sepp62 commented 7 years ago

Test results so far:

I will try to understand the mechanism for "notify" and hopefully can find a workaround, but it seems to be rather complex.

chegewara commented 7 years ago

Before that i would try different approach. But it might be not so easy since you are playing with arduino ide and without installed esp-idf. But lets try. You will have to edit sdkconfig file, but before make a backup. This file can be found in arduino/espressif/esp32/tools/sdk folder. You will have to setup like this:

Now you will have to make few experiments. Try to change this options to pin BT stack and bluedroid stack to the same or different cores:

Of course all experiments you have to do with short delay time. You can also try to make this stack little bit higher: BTC_TASK_STACK_SIZE It is set by default to 3072 but you can raise it a bit.

Sepp62 commented 7 years ago

Thanks for advice.

In between I installed ESP-IDF, but I am not yet familiar with build procecss and toolchain.

I will try your proposals asap, but this will probably take some time.

Sepp62 commented 7 years ago

OK, this is my configuration:

changed existing: CONFIG_FREERTOS_UNICORE=y --> CONFIG_FREERTOS_UNICORE=n removed: CONFIG_BTDM_CONTROLLER_RUN_CPU=0

new:

CONFIG_BTDM_CONTROLLER_PINNED_TO_CORE_CHOICE=y
CONFIG_BTDM_CONTROLLER_PINNED_TO_CORE_0=n
CONFIG_BTDM_CONTROLLER_PINNED_TO_CORE_1=y

exists: CONFIG_BLUEDROID_ENABLED=y new:

CONFIG_BLUEDROID_PINNED_TO_CORE_CHOICE=y
CONFIG_BLUEDROID_PINNED_TO_CORE_0=n
CONFIG_BLUEDROID_PINNED_TO_CORE_1=y

exists: CONFIG_BTC_TASK_STACK_SIZE=8192

Things unfortunately did not improve.

I played with different cores for BTDM and BLUEDROID, but this did not change anything.

The changes in sdkconfig were not reflected in tools/sdk/config/sdkconfig.h. Is that correct ?

chegewara commented 7 years ago

Im not very familiar with arduino, but i think it should have change skdconfig.h. Maybe ask on forum on iirc about this.

nkolban commented 7 years ago

One of the things I'd like to suggest is switching on debugging using Tools > Core Debug Level. At runtime, this should produce a world of good diagnostics that we can use to determine what is going on. If you can run with those enabled we can take a look at the results. Post those results to a pastebin and paste the link to the result here.

image

Sepp62 commented 7 years ago

Sorry, my log was not successful. I had to delete it from here. I must do it more thouroughly.

Sepp62 commented 7 years ago

Well, now we have the situation, we did not want to have: The problem does not occur with detailed log information ("debug" and higher) and with "info" level nothing can be seen from the log.

I shortened delay to 1 ms. With debug log on, there may be (rare) cases where the first reconnect fails. But the second was successful in all my trials. With log on "info" level, reconnect is not possible after 1 or 2 trials and the ESP has be rebooted.

May be one thing is suspicious in the debug log: Some of the error message "unknown connection id" seem to come from a different thread and are not synchronized with the other log outputs. Probably this is normal... I attached a disconnect sequence (from which the program could recover), since I do not have anything better. ble_uart.log

nkolban commented 7 years ago

Thank you kind sir ... studying now. We have a "rash" of these problems ... things fail, we switch on diagnostics to see what is going on internally and they work. Very quantum mechanics. We keep looking for a pattern that results in such but are having a hard time seeing the cause. From the trace you sent me, I was surprised to find "missing" chunks of trace ... I examined the source code and found some errors relating to NOT logging trace in an Arduino environment. These have been fixed in #210. None of these would have caused failures ... just missing trace. If you were to rebuild the BLE C++ classes now we would have more trace ...

There is a way to find out what is going on in your code ... but it isn't pretty. If we imagine that switching on trace causes something to work then we take a copy of our the complete source. Now we look at the trace that is generated. We then go to the corresponding source line and comment out that trace statement, compile and run it again. At some point, we will find that commenting out one of the trace lines causes it to fail again ... and THERE is where the issue will be narrowed to. Looking at your trace, we see it generates hundreds of trace lines ... however there is only about a dozen distinct trace line occurrences that are called over and over again.

Sepp62 commented 7 years ago

Thanks your your fast response and advice.

Just for my understanding: You write: "There is a way to find out what is going on in your code...". I used the standard sample from the ESP-BLE repository (ble_uart) and changed just one (significant) line (delay(1000) --> delay(1)). So everybody on this planet is supposed to have the same effect, when he will change this line.

Or do you think, that there is something wrong with my enviroment ?

I got a current ESP_BLE.zip from cheguevara yesterday and the #ifdef ARDUINO_ARCH_ESP32 is already present in some sources.

Let me know, how I can help to finde the problem.

Sepp62 commented 7 years ago

@chegewara To complete some testing:

I read some articles about how sdkconfig.h is created and finally inserted these switches directly to sdkconfig.h. I tested all permutations of "CORE_0" and "CORE_1" but the problem persists:

define CONFIG_BLUEDROID_PINNED_TO_CORE_CHOICE 1

define CONFIG_BLUEDROID_PINNED_TO_CORE_0 0/1

define CONFIG_BLUEDROID_PINNED_TO_CORE_1 1/0

define CONFIG_BTDM_CONTROLLER_PINNED_TO_CORE_CHOICE 1

define CONFIG_BTDM_CONTROLLER_PINNED_TO_CORE_0 1/0

define CONFIG_BTDM_CONTROLLER_PINNED_TO_CORE_1 0/1

chegewara commented 7 years ago

I thought it can be fixed with #211 and it is but it is not. I have some success. After first disconnect/reconnect it works but on the second disconnect is not starting advertising. Obviously its not snippets library fault, it is issue with esp-idf BT stack and should be reported on their repo issues.

[D][BLECharacteristic.cpp:636] setValue(): << setValue
[D][BLECharacteristic.cpp:490] notify(): >> notify: length: 1
D[D][BLECharacteristic.cpp:531] notify(): << notify
6 3 fail sendi**ng*  nSoentit fVicalatueio: n
 9E * *(3*
          9470) [D][BLECharacteristic.cpp:629] setValue(): >> setValue: length=1, data=45, characteristic UUID=6e400003-b5a3-f393-e0a9-e50e24dcca9e
[D][BLECharacteristic.cpp:636] setValue(): << setValue
[D][BLECharacteristic.cpp:490] notify(): >> notify: length: 1
B[D][BLECharacteristic.cpp:531] notify(): << notify
T: Unknown c*on**n ecSetniot n VaIDl: ue3: f 7a0il * s**en
                                                          ding [D][BLECharacteristic.cpp:629] setValue(): >> setValue: length=1, data=46, characteristic UUID=6e400003-b5a3-f393-e0a9-e50e24dcca9e
[D][BLECharacteristic.cpp:636] setValue(): << setValue
[D][BLECharacteristic.cpp:490] notify(): >> notify: length: 1
n[D][BLECharacteristic.cpp:531] notify(): << notify
otification
*** Sent Value: 71 ***
                      [D][BLEDevice.cpp:80] gattServerEventHandler(): gattServerEventHandler [esp_gatt_if: 3] ... ESP_GATTS_DISCONNECT_EVT
[D][BLECharacteristic.cpp:629] setValue(): >> setValue: length=1, data=47, characteristic UUID=6e400003-b5a3-f393-e0a9-e50e24dcca9e
[D][BLEUtils.cpp:1611] dumpGattServerEvent(): GATT ServerEvent: ESP_GATTS_DISCONNECT_EVT
[D][BLECharacteristic.cpp:636] setValue(): << setValue
[D][BLEUtils.cpp:1687] dumpGattServerEvent(): [conn_id: 0, remote_bda: 78:1f:db:c0:c7:f3]
[D][BLECharacteristic.cpp:490] notify(): >> notify: length: 1
[D][BLEServer.cpp:175] handleGATTServerEvent(): >> handleGATTServerEvent: ESP_GATTS_DISCONNECT_EVT
[D][BLECharacteristic.cpp:531] notify(): << notify
[D][BLEServer.cpp:326] startAdvertising(): >> startAdvertising
[D][BLEAdvertising.cpp:86] start(): >> start
[D][BLEAdvertising.cpp:105] start(): - no services advertised
[D][BLEAdvertising.cpp:137] start(): << start
[D][BLEServer.cpp:328] startAdvertising(): << startAdvertising
[D][BLEServer.cpp:287] handleGATTServerEvent(): << handleGATTServerEvent
E (39625) BT: Unknown connection ID: 3 fail sending notification
E (39631) BT: Unknown connection ID: 3 fail sending notification
E (39637) BT: Unknown connection ID: 3 fail sending notification
E (39642) BT: Unknown connection ID: 3 fail sending notification
E (39648) BT: Unknown connection ID: 3 fail sending notification
E (39654) BT: Unknown connection ID: 3 fail sending notification
E (39660) BT: Unknown connection ID: 3 fail sending notification
E (39665) BT: Unknown connection ID: 3 fail sending notification
E (39671) BT: Unknown connection ID: 3 fail sending notification
E (39677) BT: Unknown connection ID: 3 fail sending notification
E (39683) BT: Unknown connection ID: 3 fail sending notification
E (39688) BT: Unknown connection ID: 3 fail sending notification
E (39694) BT: Unknown connection ID: 3 fail sending notification
E (39700) BT: Unknown connection ID: 3 fail sending notification
E (39705) BT: Unknown connection ID: 3 fail sending notification
E (39711) BT: Unknown connection ID: 3 fail sending notification
E (39717) BT: Unknown connection ID: 3 fail sending notification
E (39723) BT: Unknown connection ID: 3 fail sending notification
E (39728) BT: Unknown connection ID: 3 fail sending notification
E (39734) BT: Unknown connection ID: 3 fail sending notification
E (39740) BT: Unknown connection ID: 3 fail sending notification
E (39746) BT: Unknown connection ID: 3 fail sending notification
E (39751) BT: Unknown connection ID: 3 fail sending notification
E (39757) BT: Unknown connection ID: 3 fail sending notification
E (39763) BT: Unknown connection ID: 3 fail sending notification
E (39768) BT: Unknown connection ID: 3 fail sending notification
E (39774) BT: Unknown connection ID: 3 fail sending notification
E (39780) BT: Unknown connection ID: 3 fail sending notification
E (39786) BT: Unknown connection ID: 3 fail sending notification
E (39791) BT: Unknown connection ID: 3 fail sending notification
E (39797) BT: Unknown connection ID: 3 fail sending notification
E (39803) BT: Unknown connection ID: 3 fail sending notification
E (39809) BT: Unknown connection ID: 3 fail sending notification
E (39814) BT: Unknown connection ID: 3 fail sending notification
E (39820) BT: Unknown connection ID: 3 fail sending notification
E (39826) BT: Unknown connection ID: 3 fail sending notification
E (39832) BT: Unknown connection ID: 3 fail sending notification
E (39837) BT: Unknown connection ID: 3 fail sending notification
[D][BLEUtils.cpp:1067] dumpGapEvent(): Received a GAP event: ESP_GAP_BLE_ADV_DATA_SET_COMPLETE_EVT
[D][BLEUtils.cpp:1075] dumpGapEvent(): [status: 0]
[D][BLEServer.cpp:134] handleGAPEvent(): BLEServer ... handling GAP event!
[D][BLEUtils.cpp:1067] dumpGapEvent(): Received a GAP event: ESP_GAP_BLE_SCAN_RSP_DATA_SET_COMPLETE_EVT
[D][BLEUtils.cpp:1261] dumpGapEvent(): [status: 0]
[D][BLEServer.cpp:134] handleGAPEvent(): BLEServer ... handling GAP event!
[D][BLEUtils.cpp:1067] dumpGapEvent(): Received a GAP event: ESP_GAP_BLE_ADV_START_COMPLETE_EVT
[D][BLEUtils.cpp:1099] dumpGapEvent(): [status: 0]
[D][BLEServer.cpp:134] handleGAPEvent(): BLEServer ... handling GAP event!
Sepp62 commented 7 years ago

I found some kind of "workround", which might probably help to find a real solution:

I protected the notify-method in the same way as the indicate-method:

  m_semaphoreConfEvt.take( "notify" );

esp_err_t errRc = ::esp_ble_gatts_send_indicate(
        getService()->getServer()->getGattsIf(),
        getService()->getServer()->getConnId(),
        getHandle(), length, (uint8_t*)m_value.getValue().data(), false); // The need_confirm = false makes this a notify.
if (errRc != ESP_OK) {
    ESP_LOGE(LOG_TAG, "<< esp_ble_gatts_send_indicate: rc=%d %s", errRc, GeneralUtils::errorToString(errRc));
    return;
}
    m_semaphoreConfEvt.wait( "notify" );

Then I made the semaphore accessible from the main program:

 class BLECharacteristic {
 public:
  ...
   void giveConfEvtSema() { m_semaphoreConfEvt.give(); }
  ...
  }

The callback looks like this now:

class MyServerCallbacks: public BLEServerCallbacks {
void onConnect(BLEServer* pServer)
{
  Serial.println( "connected" ); 
  deviceConnected = true;
  pCharacteristic->giveConfEvtSema();
};
void onDisconnect(BLEServer* pServer) {
  Serial.println( "disconnected" ); 
  deviceConnected = false;
  pCharacteristic->giveConfEvtSema();
}

};

Of course, the notify-method is now "blocking" in the same way as the indicate-mehod. This means data throughput might be lower than with the current code. On the other hand, there is no queue to buffer the characters, so I ask myself how the current code avoids overflows. Probably it has no other chance as to throw away characters which cannot be sent.

I removed the delay command completey and switched off all logging. I could not observe any hangups anymore. To improve data throughput, I send 20 characters at a time.

nkolban commented 7 years ago

You are GOOD!!!

We need to look at the esp_ble_gatts_send_indicate() method now and see what its semantics are. For many of the underlying ESP-IDF APIs we get a callback event to indicate that the function has completed. This allows us to serialize ESP-IDF API calls so that we don't try and make two calls simultaneously.

The story is that the ESP-IDF provides a low level API for working with BLE. These C++ classes attempt to build a higher level story upon that . The ESP-IDF is itself built on a 3rd party library called "bluedroid" that implements the BLE protocol at a lower level and "bluedroid" is itself mapped onto the Bluetooth hardware again provided by ESP32. This is quite a deep stack. The (somewhat) documented API and the API used by these C++ classes is the ESP-IDF API. In that API there is currently silence on whether or not multiple BLE APIs can be executed simultaneously. Ideally they should be and would then be queued for execution internally. I wasn't 100% sure if that was allowed so I implemented "serialization" as best I could ... but Ive always wondered if it was even necessary. In the example you have, I'm not sure that there is an implicit trigger that says the notification/indication has completed ... and that's where we'll look now.

Sepp62 commented 7 years ago

Thanks for your detailed explanation. In between, I installed esp-if and can compile the sources for this project as well as for ESP_BLE. I also followed the discussion about multiple BLE instances.

I assume, that the notify-method must be "secured" by a semaphore also in a more sophisticated solution as mine.

As a "user" of the BLE interface in an Arduino environment, I would like to have a function which can check, whether the API is "ready to get data" without a blocking wait. Arduino is not multi-threaded and though it is possible to create tasks with FreeRTOS, "normal" arduino coders are not used to it and are also not able to handle it.

Unfortunately the FreeRTOS-wrapper of ESP_BLE does not deliver a return code for "Semaphore::take": void FreeRTOS::Semaphore::take(std::string owner)

This means, that one cannot poll the semaphore to check, whether the API is "ready". The original FreeRTOS function is able to be polled (xTicksToWait is set to "0"): http://www.freertos.org/a00122.html

Excerpt: xTicksToWait The time in ticks to wait for the semaphore to become available. The macro portTICK_PERIOD_MS can be used to convert this to a real time. A block time of zero can be used to poll the semaphore.

Returns:pdTRUE if the semaphore was obtained. pdFALSE if xTicksToWait expired without the semaphore becoming available.

So my proposal ist to change the FreeRTOS wrapper in the following way: bool FreeRTOS::Semaphore::take(std::string owner)

I am aware, that this might not work for the "pthread_mutex_lock"-mode, so it probably make sense to make a separate method like this (log output omitted):

/**
 * @brief Check, if a semphore is taken by another thread
 * Try to take a semaphore w/o waiting and return true, if it has been already taken by another thread.
 * @return true - if sempahore has been taken by another thread.
 */
bool FreeRTOS::Semaphore::isTaken() {
  bool bTaken = true;
    ESP_LOGV(LOG_TAG, "Semaphore taken check" );

    if (m_usePthreads) {
        assert(false);
      ESP_LOGV(LOG_TAG, "Semaphore::IsTaken illegal mode usePthreads");
    } else {
        if( ( bTaken = xSemaphoreTake(m_semaphore, 0) ) )
      xSemaphoreGive( m_semaphore );
      ESP_LOGV(LOG_TAG, "Semaphore taken:  %d", !bTaken);
    }
    return !bTaken;
} // Semaphore::take

As a consequence, the BLECharacteristic class must expose a method to check the semaphore state (and probably with the connnect/disconnect state) --> i.e. bool BLECharacteristic::isReadyForData()

Another proposal is, that connect/disconnect will be routed through the BLECharacterstics class, so that the semaphore needs not to be exposed by the class interface. But this is minor important.

Sepp62 commented 7 years ago

I made all my proposed changes into the attached file.

Though I also cloned the Github repository I dare not to provide my files via Github, since I do not have any experience with it. I dont want to destroy anything (I must admit, that I am a TFS user)

ESP_BLE_Issue209.zip

Sepp62 commented 7 years ago

I proceeded with testing my changes, but the result is not yet 100% satisfying.

With the synchronization of notify and confirmation, the transmissions seems generally to to stable.

But data rate is far away from the theoretical maximum (see https://punchthrough.com/blog/posts/maximizing-ble-throughput-on-ios-and-android).

I manage to transmit ~7 packets per second with a payload of 20 bytes. Playing with the packet length could improve the situation but my "guts" say, there is still something improveable in the interaction between application and BLE-stack.

Additionally there is still something wrong with my synchronisation, so that the main loop is blocked as long as the client is disconnected.

Sepp62 commented 7 years ago

I tried to figure out, how a notification call takes its way to the stack. I think, I found it out for the "outgoing" direction. There are two queues involved. The way back for the "confirmation event" is not yet clear for me. "Confirmation" in this context does not mean that it comes from the client (since notification do not have one), but there is some confirmation for the notify-call which is received in BLECharacteristic::handleGATTServerEvent.

It starts from the notify-call:

…\Arduino\libraries\ESP32_BLE\src\BLECharacteristic.cpp void BLECharacteristic::notify()

Call …\esp-idf\components\bt\bluedroid\api\esp_gatts_api.c esp_err_t esp_ble_gatts_send_indicate(...)

Pack into queue: btc_transfer_context(…) C:\esp-idf\components\bt\bluedroid\btc\core\btc_task.c btc_task_post(btc_msg_t *msg, task_post_t timeout)

Queue receiver: C:\esp-idf\components\bt\bluedroid\btc\profile\std\gatt\btc_gatts.c btc_gatts_call_handler(btc_msg_t *msg) case BTC_GATTS_ACT_SEND_INDICATE: BTA_GATTS_HandleValueIndication(…)

C:\esp-idf\components\bt\bluedroid\bta\gatt\bta_gatts_api.c void BTA_GATTS_HandleValueIndication (…)

Pack into queue with param BTA_GATTS_API_INDICATION_EVT: bta_sys_sendmsg(p_buf); Posts to queue: if (btu_task_post(SIG_BTU_BTA_MSG, p_msg, TASK_POST_BLOCKING) != TASK_POST_SUCCESS)

Queue receiver: C:\esp-idf\components\bt\bluedroid\bta\gatt\bta_gatts_main.c bta_gatts_hdl_event(BT_HDR *p_msg) case BTA_GATTS_API_INDICATION_EVT: bta_gatts_indicate_handle(p_cb, (tBTA_GATTS_DATA *) p_msg);

void bta_gatts_indicate_handle (tBTA_GATTS_CB *p_cb, tBTA_GATTS_DATA *p_msg)

call with mode „notification“: status = GATTS_HandleValueNotification (…) C:\esp-idf\components\bt\bluedroid\stack\gatt\gatt_api.c

Call: attp_send_sr_msg (p_tcb, p_buf); with parameter GATT_HANDLE_VALUE_NOTIF

Call of Layer 2-function: C:\esp-idf\components\bt\bluedroid\stack\gatt\att_protocol.c attp_send_msg_to_l2cap (p_tcb, p_msg);

Sepp62 commented 7 years ago

Now, I traced the way of the information flow for the confirmation event. In our problematic case, all starts here:

…\esp-idf\components\bt\bluedroid\bta\gatt\bta_gatts_act.c void bta_gatts_indicate_handle (tBTA_GATTS_CB *p_cb, tBTA_GATTS_DATA *p_msg)

This is, where the last seen error log is issued: APPL_TRACE_ERROR("Unknown connection ID: %d fail sending notification", p_msg->api_indicate.hdr.layer_specific);

In case of error or when no data confirmation is needed, the confirmation callback is called: (*p_rcb->p_cback)(BTA_GATTS_CONF_EVT, &cb_data);

Callback is routed through the event handlers: …\esp-idf\components\bt\bluedroid\btc\profile\std\gatt\btc_gatts.c void btc_gatts_cb_handler(btc_msg_t *msg) case BTA_GATTS_CONF_EVT: gatts_if = BTC_GATT_GET_GATT_IF(p_data->req_data.conn_id); btc_gatts_cb_to_app(ESP_GATTS_CONF_EVT, gatts_if, &param); and is supposed to land in our C++ code (I lost the traces here).

But this never happens. My trace looks like this: [E][BLECharacteristic.cpp:412] handleGATTServerEvent(): ----- ESP_GATTS_CONF_EVT ------ [E][BLECharacteristic.cpp:412] handleGATTServerEvent(): ----- ESP_GATTS_CONF_EVT ------ [E][BLECharacteristic.cpp:412] handleGATTServerEvent(): ----- ESP_GATTS_CONF_EVT ------ E (42742) BT: Unknown connection ID: 3 fail sending notification [E][BLECharacteristic.cpp:424] handleGATTServerEvent(): ----- ESP_GATTS_DISCONNECT_EVT ------ [E][BLECharacteristic.cpp:424] handleGATTServerEvent(): ----- ESP_GATTS_DISCONNECT_EVT ------ E (42765) BT: Unknown connection ID: 3 fail sending notification

After the first "fail sending notification", there still is a confirmation event. Obviously, there is still something in the queue and there are more notifications, which again case an error "fail sending notification".

But in this last case, there is no confirmation event received anymore. The result is, that the confirmation event semaphore is not freed and my code hangs.

Unfortunately this does not happen, when log level is "debug".

So the big question is, why the confirmation callback is not received in the "disconnected" case.

chegewara commented 7 years ago

Unfortunately this does not happen, when log level is "debug".

If you will find somehow reason why it works with particulat debug level please let us know. Its a mystery. For example, ive got an issue and ive found that i can "fix it" by adding this line in one and only one place in code: ESP_LOGE(LOG_TAG, "").

It has to be log error, and can be with null log message.

PS log level does not matter in this case

Sepp62 commented 7 years ago

I tried to compare logs in "bad case" and "good case". In found, that in "good case" the "fail sending notification error" occurs only once and not twice as in the bad case.

I attached the comparison in a file. ble_uart_good_bad.txt

Unfortunately, up to now, I cannot give you any resonable feedback to the logging issue.

BTW: I found a sleep-command in the semaphore code, which probably makes "effects" more opaque.

/**
 * @brief Give a semaphore.
 * The Semaphore is given.
 */
void FreeRTOS::Semaphore::give() {
    ESP_LOGV(LOG_TAG, "Semaphore giving: %s", toString().c_str());
    if (m_usePthreads) {
        pthread_mutex_unlock(&m_pthread_mutex);
    } else {
        xSemaphoreGive(m_semaphore);
    }
#ifdef ARDUINO_ARCH_ESP32
    FreeRTOS::sleep(10);
#endif
    m_owner = std::string("<N/A>");
} // Semaphore::give
Sepp62 commented 7 years ago

I finally got my application working !

here's what made the breakthrough: 1) I removed "StartAdvertising()" from the callback of the disconnect event in BLEServer. Instead, I synchronized the disconnect state to my main thread and called StartAdvertising() at the transition from connect to disconnect. 2) I removed the sleep(10) call in the semaphore class (see one post above)

Omitting the sleep caused disappearing the "fail sending notification", but transmission now seems to be "rough" which means that strings in my terminal program seem to come somehow erratic.

I am still not at the end...

I am not (yet) familiar enough with the ESP-arduino code to give definite advice. But probably you now have sufficient information to make a robust solution. If my code can be helpful, I will put it here (unless not already done).

Sepp62 commented 7 years ago

After removing the sleep command and firing packets "full throttle", I get the system into the "congestion" state. This happens every 20-30 packets (roughly estimated) and in spite of my notify/confirmation synchronization.

This is reported from Bluetooth L2CAP and I assume, that a queue is running over. The situation is not depending on the number of bytes, but of the number of packets.

Placing a delay(10) into my loop() and reducing the packet rate by this way, the effect is gone. The data rate is not affected, I still have ~7 packets per seconds.

This means, a controlling program should limit the packet rate by making a break of at least 10 ms between two packets. Additionally sync'ing notify/confirmation is necessary.

I tried to handle the suggestion state by reducing packet rate during the congestion, but this did not help. It is easier to take care for a break of 10ms between packets.

Sepp62 commented 7 years ago

Changed sources: ESP32_BLE_Issue209.zip

Sepp62 commented 7 years ago

After one more evening with testing, I got the following result:

So, this sequence worked for me:

  1. Notify all instances about disconnect
  2. Wait for 500ms
  3. StartAdvertising()

I will post another message with my conclusions. My activities in this matter have been terminated for the time being.

Sepp62 commented 7 years ago

In summary, these are my conclusions on the problem:

  1. Synchronize the notify() command with the confirmation event in the BLECharacterstics class. Free the event semaphore also on "connect" and "disconnect".
  2. Remove the Sleep(10) command from the FreeRTOS::Semaphore::give() method. This not only costs precious CPU time, it also has unwanted effects on the message flow behaviour.
  3. Give the coder a chance to check, whether the API is ready to transmit data without blocking the calling thread (=loop()). I called this method "isReadyForData".
  4. Don't call StartAdverstising() too early after a disconnect.
  5. For the API user: Dont call notify too often. Even with all the synchronisation, I got the bluetooth communication layer into congestion situations. A delay of 10ms after a notify call (or some other code with does useful thongs) prevents this situation.

Hopefully, these results are helpful to improve the code.

Last not least many thanks to nKolban, chegewara and all the others for their precious work on the ESP arduino project !

Sepp62 commented 7 years ago

I have my changes ready to provide them as "pull request". My questions to @nkolban/@chegewara:

nkolban commented 7 years ago

Howdy my friend ... to make permanent changes to the repository, you fork the code base found here:

https://github.com/nkolban/esp32-snippets

this would give you a repository of your own ... perhaps called

https://github.com/sepp62/esp32-snippets

You then make the desired code changes in YOUR copy of the esp32-snippets which would be a copy of the one your forked. When your code changes are made, you then go back to:

https://github.com/sepp62/esp32-snippets

and issue a "Pull Request".

That then indicates to the repository :

https://github.com/nkolban/esp32-snippets

That you wish your code merged into the master. We then execute a "Merge" against the repository:

https://github.com/nkolban/esp32-snippets

and all the changes you made get merged into the master.

For the master esp32-snippets ... there is only one branch "master" ... there could be unlimited forks... that simply means that folks have chosen to fork the code either for local tinkering or for eventual merging but history shows that a Github fork doesn't mean that it will be eventually merged back in.

The changes that you make to your copy of the code have to be able to be merged ... you will be warned if your changes aren't mergable and it is your responsibility to ensure that they merge cleanly. The only other rules are ensure that if you change code, it compiles 100% cleanly and works to the best of your knowledge. Don't reformat the source file (eg tabs to spaces or indentation changes), try and add as much commentary as possible in the code and in the pull request.

My assurance is that when you issue a pull request, I will turn it around as fast as possible. Too many times folks issue pull requests against repositories and they just lie there for ever ... that makes a contributor sad as his/her work goes un-noticed for the longest time.

The descriptions above are 100% normal/classic Github usage. If any of the above makes you nervous, then STOP ... go and have a good read at Github ... its boring reading and complex ... but code maintenance and distributed merging really is boring and complex :-)

If after studying Github you still don't feel comfortable, feel free to post back and I for one will be willing to work with you directly to hand merge your changes. It would probably take us about 20-30 minutes and I'd want to do it live with you on a screen share. I'd take a zip of your source code that you would send me outside of Github, I'd then run some tools over it to determine what has changed ... and then I would manually merge it in with your review and discussion. The last thing I want to do is turn you away. Contributions from everyone are VERY welcome.

nkolban commented 6 years ago

Howdy @Sepp62 ... can you walk me through the intent of the function:

BLECharacteristic#isReadyForData()

Reading the code, it seems to query the semaphore called "m_semaphoreConfEvt" and returns true if it is open and false if it is taken.

Ok ... so now I'm thinking ... what does isReadyForData mean? I'm not sure I like the name of the function ... what does isReadyForData mean to a reader? If I am understanding its intent it is really saying "Has a previous indicate() or notify() event completed?" Is that your understanding? If yes, then I'm not sure isReadyForData is a good name for the function ... it is too vague/loose.

Further, as I read the code, if I am understanding correctly, the m_semaphoreConfEvt is taken before an indicate/notify and released when an indicate/notify completes. Lets think about this for a moment. In the implementation of the BLE C++ classes, nothing is asynchronous. Meaning that when we perform a BLE API call, we explicitly block waiting for that call to return. If we look at the code of indicate()/notify() we see ...

  1. We take the m_semaphoreConfEvt
  2. We perform indicate/call
  3. We wait for the corresponding semaphore to be unlocked
  4. We return from the indicate()/notify()

What this means is that there is never a time where within a single thread, isReadyForData() will return false because we always give() the semaphore before returning from the indicate()/notify() logic.

See also #259

Sepp62 commented 6 years ago

Hi @nkolban, you are completely right with all your assumptions above. It must have been middle of the night, when I made this...

As a consequence, some more code is obsolete. I suggest, that I clean it up, correct the "notify"-examples and test the whole thing anew. Then I make another pull request.

Since I am a little bit busy with other things right now, give me one week. Otherwise, if you want to clean it up by yourself, do not hesitate...

Sorry for inconvenience!

But there is another thing. I would like to discuss at this occasion:

nkolban commented 6 years ago

Howdy @Sepp62 ... there is no great rush. You tell me how you would like to work. I'm keen to make everyone feel welcome in this project and the last thing I want to do is to appear to "step all over" the work of my new friends and colleagues. If you code something and ask for my help ... I'll be there. If you code something and it breaks ... then I'll be chat with you and see how it can be fixed. If you can't fix it or ask for help ... then I'll jump in and fix it (or comment it out). With github, code is never lost and we can always "step back in time" to undo changes made in the past.

However, by the same token, if you make code changes and you "enjoy" making them, I don't want you to lose the feeling of pleasure or ownership in having made them so you won't find me stepping on your code for no reason.

The story I try and follow is "service to the end customer" and those are the users of the BLE classes. Whether I code something or you code something, if its is broken for an end user, that is bad and needs fixed as soon as possible ... ideally properly and not a "hack-job" ...

So ... take your time. If you won't make changes ... let me know and I'll make them. If you want to make changes but it will be some time in the future ... then work with the person suffering with the problem and ask them if that will be ok. If it isn't, make that known and someone else in our community can have a go.

As for the other issues ... we'll track those one at a time .. ideally in an issue that is named after the puzzle being tracked.

Sepp62 commented 6 years ago

@nkolban: issue #266 opened for "threading discussion"

NimaaZx commented 5 years ago

I think it was for synchronization between client side and notify. For Fixing Stuck onDisconnect after many reboots in BLE_Client Example before BLEDevice::init("") add "delay(2000)".