h2zero / NimBLE-Arduino

A fork of the NimBLE library structured for compilation with Arduino, for use with ESP32, nRF5x.
https://h2zero.github.io/NimBLE-Arduino/
Apache License 2.0
670 stars 138 forks source link

NimBLE_Scan_Continuous.ino possible heap memory leak? #591

Open nravanelli opened 9 months ago

nravanelli commented 9 months ago

I have been noticing a potential memory leak in the continuous scan example. I have only added the following to the end of void loop():

Serial.print("HEAP: ");
Serial.println(ESP.getFreeHeap());
delay(2000);

And I see heap memory use reduce continuously:

14:15:03.247 -> HEAP: 190336
14:15:05.233 -> HEAP: 190208
14:15:07.242 -> HEAP: 190208
14:15:09.245 -> HEAP: 190208
14:15:11.230 -> HEAP: 190208
14:15:13.242 -> HEAP: 190208
14:15:15.221 -> HEAP: 190080
14:15:17.219 -> HEAP: 190080
14:15:19.245 -> HEAP: 190080
14:15:21.244 -> HEAP: 189952
14:15:23.245 -> HEAP: 189952
14:15:25.247 -> HEAP: 189952
14:15:27.215 -> HEAP: 189952
14:15:29.229 -> HEAP: 189952
14:15:31.248 -> HEAP: 189952
14:15:33.254 -> HEAP: 189952
14:15:35.086 -> Advertised Device: Name: , Address: 34:cc:2b:5e:5b:86, manufacturer data: 4c000906036a0a640627 
14:15:35.211 -> HEAP: 189952
14:15:37.245 -> HEAP: 189952
14:15:38.702 -> Advertised Device: Name: , Address: e8:fc:60:39:21:0b, manufacturer data: 4c00121900e46fc9dc373dd41da5665fc8d8293373c06494dc5b4d0100 
14:15:39.217 -> HEAP: 189824
14:15:41.251 -> HEAP: 189824
14:15:43.235 -> HEAP: 189824

Is there something that is not being cleared properly in the example?

JustTryingToGetSomeWorkDone commented 8 months ago

Sorry, I'm not very experienced with BLE and have problems of my own, but here's what I've noticed: It looks like it always decreases by 72 bytes every 16~18 seconds or so whether or not a new advertised device is found. It's a pretty simple example program in my opinion. The only call I'm not familiar with is; NimBLEDevice::setScanDuplicateCacheSize(200);

I've never used this function, but it seems like it would be used to save the previously discovered mac addresses to compare against newly discovered mac addresses. Looking at the code backs that up. I see nothing obvious (which is pretty much all I would notice given my programming skills), but I'm guessing it has to do with the duplicate entries, which are not being reported due to being duplicates, are somehow still taking up heap space.

Sorry I can't help more than that. If you're up to it, here's the part of the code where the function is located. Maybe you can track it down if you really need: Starts on line 503. https://github.com/h2zero/NimBLE-Arduino/blob/release/1.4/src/NimBLEDevice.cpp

Before I go, I just noticed the comment above that function... here's the whole block from the above link:

/**
 * @brief Set the duplicate filter cache size for filtering scanned devices.
 * @param [in] cacheSize The number of advertisements filtered before the cache is reset.\n
 * Range is 10-1000, a larger value will reduce how often the same devices are reported.
 * @details Must only be called before calling NimBLEDevice::init.
 */
/*STATIC*/
void NimBLEDevice::setScanDuplicateCacheSize(uint16_t cacheSize) {
    if(initialized) {
        NIMBLE_LOGE(LOG_TAG, "Cannot change scan cache size while initialized");
        return;
    } else if(cacheSize > 1000 || cacheSize <10) {
        NIMBLE_LOGE(LOG_TAG, "Invalid scan cache size; min=10 max=1000");
        return;
    }

    m_scanDuplicateSize = cacheSize;
}

"* @param [in] cacheSize The number of advertisements filtered before the cache is reset.\n" The way this is worded makes me think each advertisement, whether it is the first or tenth time seeing it, gets added to the cache. Then after 200 duplicates, the cache gets dumped and the merry-go-round starts again. Each duplicate is recorded, so you could be seeing 100 advertisements from one nearby device taking up cache/heap but not being reported because it's not new.

If you try a smaller cache size or watch the heap longer, you might see the cache get dumped and restart. Then the heap would suddenly get smaller before looking like it is leaking again. At least that's the impression I get from that comment, and it backs up my initial guess. If it is recording duplicate advertisements to the cache instead of deleting duplicates, that would explain this behavior. Maybe a better method would involve deleting duplicates instead of just adding them to the cache if this is indeed the case. Sorry I can't help more than this, but it may be the intended functionality. We'll have to wait for word from h2zero.

h2zero commented 8 months ago

Thanks, I can't say i've noticed but I will test and get back

scripter-co commented 8 months ago

I've also noticed something similar. I recently migrated 5 devices to use NimBLE instead of the "native" Arduino BLE and all devices are exhibiting the same behaviour. I have put together a sample of the heap size over the last ~12 hours here.

It's worth noting that I know it's related to the NimBLE refactor because it's the only thing that was changed between these releases.

The relevant code is:

void connectToThermometer()
{
  NimBLEDevice::init("");
  NimBLEDevice::setPower(ESP_PWR_LVL_P9);

  pBLEScan = NimBLEDevice::getScan();
  pBLEScan->setAdvertisedDeviceCallbacks(new MyAdvertisedDeviceCallbacks(), false);
  pBLEScan->setInterval(97);
  pBLEScan->setWindow(37);
  pBLEScan->setMaxResults(0);
  pBLEScan->setActiveScan(true);
  startBLE();
}

void startBLE()
{
  pBLEScan->start(0, nullptr);
}

bool isBLEStopped()
{
  return pBLEScan->isScanning() == false;
}

inside the loop:

...
  if (isBLEStopped())
  {
    Serial.println('scanning failed, restarting ble');
    startBLE();
    delay(2000);
  }
...
scripter-co commented 7 months ago

looking into this a bit further, it seems that m_advertisedDevicesVector inside of NimBLEScan.cpp slowly is increasing over time.

I've added logs into the file:

                 NIMBLE_LOGI(LOG_TAG, "New advertiser: %s", advertisedAddress.toString().c_str());
+                NIMBLE_LOGI(LOG_TAG, "Number of vectors: %d", pScan->m_scanResults.m_advertisedDevicesVector.size());
             } else if (advertisedDevice != nullptr) {
                 NIMBLE_LOGI(LOG_TAG, "Updated advertiser: %s", advertisedAddress.toString().c_str());

and i can see that over a while, this number slowly increases (using the code shown above).

I added the following:

diff --git a/NimBLE-Arduino-1.4.1/src/NimBLEScan.cpp b/NimBLE-Arduino-1.4.1/src/NimBLEScan.cpp
index d1c4879..bce89a1 100644
--- a/NimBLE-Arduino-1.4.1/src/NimBLEScan.cpp
+++ b/NimBLE-Arduino-1.4.1/src/NimBLEScan.cpp
@@ -23,7 +23,7 @@
 #include <climits>

 static const char* LOG_TAG = "NimBLEScan";
-
+static const size_t maxAdvertisedDevices = 2;

 /**
  * @brief Scan constuctor.
@@ -110,6 +110,11 @@ NimBLEScan::~NimBLEScan() {
                     return 0;
                 }

+                if (pScan->m_scanResults.m_advertisedDevicesVector.size() >= maxAdvertisedDevices) {
+                    delete pScan->m_scanResults.m_advertisedDevicesVector.front();
+                    pScan->m_scanResults.m_advertisedDevicesVector.erase(pScan->m_scanResults.m_advertisedDevicesVector.begin());
+                }
+
                 advertisedDevice = new NimBLEAdvertisedDevice();
                 advertisedDevice->setAddress(advertisedAddress);
                 advertisedDevice->setAdvType(event_type, isLegacyAdv);

to test the theory that this was the issue and it did indeed make the memory to stabilise, so everything seems to be pointing to this vector being the problem (and old entries not being deleted + erased correctly). The patch above is obviously not a permanent fix but should help to implement a long term solution.

h2zero commented 7 months ago

I have looked into this and the cause is due to active scanning and how it is handled. When a device is advertising with a scan response available the device is added to the devices vector and awaits the scan response data before calling the callback and is not deleted from the vector until the scan response is received. The problem occurs when the scan response is never received, which happens quite regularly. To work around this (for now) I suggest stopping and restarting the scan occasionally to clear out the vector, or do not enable active scanning. I will contemplate options for the future to resolve this internally.

nravanelli commented 7 months ago

Thanks @h2zero for giving more clarity here. I need to maintain active scanning for various reasons, so would the temporary solution be to simply call pBLEScan->stop(); after each active scan? or should we restart the BLE stack completely?

In the documentation, the final pBLEScan->start() object states that when false, it clears the previous scan... is that not correct or does it simply not reuse the previous scan results?

h2zero commented 7 months ago

I would suggest using a timer that triggers say every 15 or 20 mins and in the callback stop the scan with pBLEScan->stop();.

In the documentation, the final pBLEScan->start() object states that when false, it clears the previous scan... is that not correct or does it simply not reuse the previous scan results?

The documentation is correct, when the is_continue parameter is false the vector will be cleared before the scan is started.

I would also add that you will need to delay for a couple milliseconds after stopping the scan before restarting for the stack to cleanup.

scripter-co commented 7 months ago

I have looked into this and the cause is due to active scanning and how it is handled. When a device is advertising with a scan response available the device is added to the devices vector and awaits the scan response data before calling the callback and is not deleted from the vector until the scan response is received. The problem occurs when the scan response is never received, which happens quite regularly. To work around this (for now) I suggest stopping and restarting the scan occasionally to clear out the vector, or do not enable active scanning. I will contemplate options for the future to resolve this internally.

That was the conclusion I came to as well. Would an option to clear out items from the vector that haven't advertised in N seconds?

I would suggest using a timer that triggers say every 15 or 20 mins and in the callback stop the scan with pBLEScan->stop();.

The reason I migrated to NimBLE is because i found that starting and stopping BLE on the ESP32 eventually causes a OWDT, restart (and using HTTP + BLE at the same time wasn't possible without using NimBLE so had to stop BLE before making a HTTP request). I'd half narrowed it down to the continuous starting/stopping of BLE (although I was doing this every 1 minute instead of 15/20 as suggested in this thread).

For me, I can disable active scanning for now, see how the devices behave and report back. Obviously this doesn't help @nravanelli.

h2zero commented 7 months ago

I'm thinking about implementing a scan response timer that will invoke the advertised device callback and then remove it from the vector when max devices is 0 if the scan response doesn't come within 500ms.

scripter-co commented 7 months ago

For me, I can disable active scanning for now, see how the devices behave and report back. Obviously this doesn't help @nravanelli.

just to follow up, as expected disabling active scanning does indeed behave normally and there's no memory leak.

h2zero commented 7 months ago

@scripter-co @nravanelli I have created a branch that should resolve this here: https://github.com/h2zero/NimBLE-Arduino/tree/sr-timer

Please try it out and let me know, thanks!

nravanelli commented 7 months ago

Ill give it a go now, thanks!

nravanelli commented 7 months ago

@h2zero I have been running an ESP32 with your new fixes to active scan for a little over 40 minutes and HEAP use has stayed constant at 229108 with the same example continuous scan with active scanning.

So I'd say this is resolved!

scripter-co commented 7 months ago

@scripter-co @nravanelli I have created a branch that should resolve this here: https://github.com/h2zero/NimBLE-Arduino/tree/sr-timer

Please try it out and let me know, thanks!

looks good. question though, if max_results != 0 it seems like this memory leak would still occur?

nravanelli commented 7 months ago

@scripter-co @nravanelli I have created a branch that should resolve this here: https://github.com/h2zero/NimBLE-Arduino/tree/sr-timer

Please try it out and let me know, thanks!

looks good. question though, if max_results != 0 it seems like this memory leak would still occur?

Indeed, the error persists when ‘max_results !=0’. In my actual code I iterate through the scan results after the scan times out and I can see the heap leak persists. When used within the example “scan continuous” example it is fixed

h2zero commented 7 months ago

I was making the assumption that if max_results is not 0 that the scan would stop and be restarted at some point, so not continuous. When restarting the scan the results are typically cleared at that time. If the scan is not restarted then yes the memory would still be occupied, in that case a call to clearResults should be done.

scripter-co commented 7 months ago

i'm struggling to find it, could you point out the line where maxResults != 0 would cause the vector to be cleared? or are you saying that the client (consumer of NimBLE) would clear the results themselves?

h2zero commented 7 months ago

Not sure I follow? The process is that if scanning continuously with max results = 0 that after the callback is called the device is deleted from the vector. If max results is not 0 devices will be added to the vector until the scan stops and only deleted when the scan is started again, with the isContinue flag set to false.