project-chip / connectedhomeip

Matter (formerly Project CHIP) creates more connections between more objects, simplifying development for manufacturers and increasing compatibility for consumers, guided by the Connectivity Standards Alliance.
https://buildwithmatter.com
Apache License 2.0
7.34k stars 1.97k forks source link

[1.1] When the subscription slot is full, reboot the device, and then make a new subscription request, and the device will call abort . #27677

Open aatr0xx opened 1 year ago

aatr0xx commented 1 year ago

Reproduction steps

1.Use an Android device to initiate a subscription to the device until the subscription slot of the device is full.(CONFIG_MAX_FABRICS 5) 2.reboot the device. 3.Make a new subscription request.

Bug prevalence

Everytime

GitHub hash of the SDK that was being used

49bad0f

Platform

esp32

Platform Version(s)

No response

Type

Core SDK Crash

Anything else?

There is the crashed task’s callstack.

`panic_abort (details=details@entry=0x3fcbf8ac "abort() was called at PC 0x4202bbe3 on core 0") at /home/aatrox/workspace/yee_esp_matter/esp-idf/components/esp_system/panic.c:408 408 ((volatile int ) 0) = 0; // NOLINT(clang-analyzer-core.NullDereference) should be an invalid operation on targets

0 panic_abort (details=details@entry=0x3fcbf8ac "abort() was called at PC 0x4202bbe3 on core 0") at /home/aatrox/workspace/yee_esp_matter/esp-idf/components/esp_system/panic.c:408

1 0x4038aac6 in esp_system_abort (details=details@entry=0x3fcbf8ac "abort() was called at PC 0x4202bbe3 on core 0") at /home/aatrox/workspace/yee_esp_matter/esp-idf/components/esp_system/esp_system.c:137

2 0x40393506 in abort () at /home/aatrox/workspace/yee_esp_matter/esp-idf/components/newlib/abort.c:46

3 0x4202bbe6 in chipAbort () at ../../../../connectedhomeip/config/esp32/third_party/connectedhomeip/src/lib/support/CodeUtils.h:508

4 chip::ConstFabricIterator::operator* (this=) at ../../../../connectedhomeip/config/esp32/third_party/connectedhomeip/src/credentials/FabricTable.h:277

5 chip::app::InteractionModelEngine::EnsureResourceForSubscription (this=this@entry=0x3fca0578 , aFabricIndex=, aRequestedAttributePathCount=1, aRequestedEventPathCount=0) at ../../../../connectedhomeip/config/esp32/third_party/connectedhomeip/src/app/InteractionModelEngine.cpp:925

6 0x4202c4f4 in chip::app::InteractionModelEngine::OnReadInitialRequest (this=this@entry=0x3fca0578 , apExchangeContext=apExchangeContext@entry=0x3fc9ca40 <chip::Server::sServer+6944>, aPayloadHeader=..., aPayload=..., aInteractionType=) at ../../../../connectedhomeip/config/esp32/third_party/connectedhomeip/src/app/InteractionModelEngine.cpp:530

7 0x4202c87c in chip::app::InteractionModelEngine::OnMessageReceived (this=0x3fca0578 , apExchangeContext=0x3fc9ca40 <chip::Server::sServer+6944>, aPayloadHeader=..., aPayload=...) at ../../../../connectedhomeip/config/esp32/third_party/connectedhomeip/src/app/InteractionModelEngine.cpp:732

8 0x4203b01c in chip::Messaging::ExchangeContext::HandleMessage (this=this@entry=0x3fc9ca40 <chip::Server::sServer+6944>, messageCounter=67188560, payloadHeader=..., msgFlags=..., msgBuf=...) at ../../../../connectedhomeip/config/esp32/third_party/connectedhomeip/src/messaging/ExchangeContext.cpp:612

9 0x4203b85c in chip::Messaging::ExchangeManager::OnMessageReceived (this=0x3fc9ca10 <chip::Server::sServer+6896>, packetHeader=..., payloadHeader=..., session=..., isDuplicate=, msgBuf=...) at ../../../../connectedhomeip/config/esp32/third_party/connectedhomeip/src/transport/raw/MessageHeader.h:151

10 0x4203ef36 in chip::SessionManager::SecureUnicastMessageDispatch (this=0x3fc9af88 <chip::Server::sServer+104>, partialPacketHeader=..., peerAddress=..., msg=...) at ../../../../connectedhomeip/config/esp32/third_party/connectedhomeip/src/transport/SessionManager.cpp:758

11 0x4203f37c in chip::SessionManager::OnMessageReceived (this=0x3fc9af88 <chip::Server::sServer+104>, peerAddress=..., msg=...) at ../../../../connectedhomeip/config/esp32/third_party/connectedhomeip/src/transport/SessionManager.cpp:564

12 0x00000000 in ?? ()`

bzbarsky-apple commented 1 year ago

src/credentials/FabricTable.h:277

On that SHA that's in const FabricInfo & operator*() const on this line:

        VerifyOrDie(!IsAtEnd());

where the caller is src/app/InteractionModelEngine.cpp:925, which is:

        for (const auto & fabric : *mpFabricTable)
bzbarsky-apple commented 1 year ago

I tried reproducing this after forcing CHIP_CONFIG_MAX_FABRICS to make it easier to fill up the fabric table, and I am not hitting this...

Are you commissioning the device into 5 fabrics? Or just one fabric? Is your server compiled with subscription persistence enabled or disabled?

If you can reproduce reliably, can you debug or add logging to see why we are ending up dereferencing an out-of-bounds iterator like that?

bzbarsky-apple commented 1 year ago

@aatr0xx See above.

aatr0xx commented 1 year ago

1.Only one fabric.

  1. Subscription persistence is enabled.
aatr0xx commented 1 year ago

About debugging, how can I help you.

bzbarsky-apple commented 1 year ago

@aatr0xx Well, where is the iterator actually pointing when we fail the assert? And how does it get into that state?

I've been trying to reproduce in the "one fabric, subscription persistence enabled" situation with the server running as a Mac application, and I am not hitting the assert...

zzz2016 commented 1 year ago

Could it be a firmware issue triggered by the subscription method on our app?

Currently, our app uses single-property subscriptions, such as switches, brightness, and other attributes. We subscribe to each property individually. Additionally, we have set subscribeParams.keepPreviousSubscriptions to YES because we found that setting this parameter to NO in previous versions of iOS resulted in slow subscription updates.

bzbarsky-apple commented 1 year ago

I don't know. Here are the exact steps I used to test just now:

  1. Set CHIP_CONFIG_MAX_FABRICS to 5 in examples/chip-tool/include/CHIPProjectAppConfig.h.
  2. Compile all-clusters-app on Mac.
  3. Run the app.
  4. Run chip-tool pairing code .... to commission it as node id 17.
  5. Run chip-tool interactive start.
  6. Run onoff subscribe on-off 100 200 17 1 --keepSubscriptions true 15 times, which I believe fills up the subscription table.
  7. Kill all-clusters-app.
  8. Restart all-clusters-app.
  9. Run onoff subscribe on-off 100 200 17 1 --keepSubscriptions true once, wait for it to time out on the old session.
  10. Run onoff subscribe on-off 100 200 17 1 --keepSubscriptions true a second time.

The server does not crash. So the first question is: when following these steps using command-line chip-tool against your crashing server app, do you get a crash? If not, does some modification of these steps reproduce the crash?

zzz2016 commented 1 year ago

I noticed that in your reproduction steps, there is no mention of power cycling the device.

Our reproduction steps are as follows: (We did not modify CHIP_CONFIG_MAX_FABRICS, so its default value should still be 5)

  1. The app starts and subscribes to 3-4 different attributes: switch, brightness, color, and a custom attribute called effectMode.
  2. Kill the app process.
  3. Repeat steps 1-2 approximately 4-5 times to fill the queue.
  4. Important: Power cycle the device and ensure that the app continues to send subscription requests while running in the foreground (Autoresubscribe = True).
  5. The device will keep rebooting after power cycling.

Today, we will provide a reproduction video for reference.

bzbarsky-apple commented 1 year ago

I don't know that a video would help. I believe you it's happening!

What would help are either:

  1. Logs that show the exact pattern of SubscribeRequest messages, with their payloads, that the server sees, so I can try reproducing that pattern.
  2. You debugging this on the server you have access to that reproduces the problem.
zzz2016 commented 1 year ago

@bzbarsky-apple We can reproduce this issue using Home, but we are unable to reproduce it through the CHIP-Tool.

Specific steps:

  1. Pair the device with HomePod Mini and Home.
  2. Power cycle the HomePod 16 times, with approximately 2 minutes interval each time, ensuring that the HomePod Mini initializes and successfully establishes a subscription link with the device.
  3. Then restart the device. The HomePod must be online while restarting the device.
  4. There is a chance that the device will restart, but it is not guaranteed.

Reproducing the issue through Home has a few differences compared to our app:

Regarding your mention of "Logs that show the exact pattern of SubscribeRequest messages, with their payloads," how should we record this part of the logs?

Here are our codes about subscription:

MTRSubscribeParams *subscribePrarams = [[MTRSubscribeParams alloc] init];
    subscribePrarams.autoResubscribe = @YES;
    subscribePrarams.keepPreviousSubscriptions = @YES;
    [self getConnectedDevice:deviceId.integerValue failed:result success:^(MTRBaseDevice * _Nonnull device) {
        [device subscribeAttributeWithEndpointId:endpointId clusterId:clusterId attributeId:attributeId minInterval:@0 maxInterval:@30 params:subscribePrarams clientQueue:dispatch_get_main_queue() reportHandler:^(NSArray<NSDictionary<NSString *,id> *> * _Nullable values, NSError * _Nullable error) {

    }];

Below is the debug-level log from our app that causes the device to reboot: Failed to get required resources by evicting existing subscriptions. crash_log.log

Below is the log from home that causes the device to reboot You can use keyword 'abort' to locate the issue: p0_log-2023-07-12_10_10_43.log

zzz2016 commented 1 year ago

If you need our app to better reproduce the issue, we can provide you with a customized app to support ESP development boards.

bzbarsky-apple commented 1 year ago

I noticed that in your reproduction steps, there is no mention of power cycling the device.

It was steps 7 and 8 in my reproduction steps, but not that important.

how should we record this part of the logs?

On the server: enable detail logging, then build with enable_im_pretty_print=true. If you're using cmake, not ninja, for esp32 there's a CONFIG_IM_PRETTY_PRINT option. For other cmake platforms I am not sure how best to do this.

Alternately, we could add some new logging code for this, but let's see if we can build with the existing logging first.

We can reproduce this issue using Home, but we are unable to reproduce it through the CHIP-Tool.

That is ... very odd, if they are subscribing to the same sets of things. But I bet they are not.

Below is the debug-level log from our app that causes the device to reboot:

So this log shows a somewhat different stack compared to the one before. This one claims an abort at "src/app/InteractionModelEngine.cpp:953". On the SHA listed at the beginning of this issue, this is:

    VerifyOrDieWithMsg(didEvictHandler, DataManagement, "Failed to get required resources by evicting existing subscriptions.");

which seems much more likely to be what's going on: instead of a mysterious crash in an iterator we now have an assertion failure in some very complicated logic. I would not be too surprised if we hit that.

Whether we do will very much depend on the exact sets of subscriptions that are around and the paths they are for, and possibly the timing... But at least now we are getting somewhere.

we can provide you with a customized app to support ESP development boards

This would be the server app? I suspect that in practice the important part here is the actual pattern of subscriptions involved @zzz2016, so the server app would not be too useful, but I will keep the offer in mind if we are not getting anywhere via other means.

bzbarsky-apple commented 1 year ago

One thing that might also help is if on your server app in InteractionModelEngine::EnsureResourceForSubscription you could log the values these quantities end up with:

    const size_t pathPoolCapacity        = GetPathPoolCapacityForSubscriptions();
    const size_t readHandlerPoolCapacity = GetReadHandlerPoolCapacityForSubscriptions();
    const size_t attributePathCap = allowUnlimited ? SIZE_MAX : pathPoolCapacity;
    const size_t eventPathCap     = allowUnlimited ? SIZE_MAX : pathPoolCapacity;
    const size_t readHandlerCap   = allowUnlimited ? SIZE_MAX : readHandlerPoolCapacity;

that way I can try to configure things locally to behave similarly to that when I am trying to reproduce.

zzz2016 commented 1 year ago

This would be the server app? I suspect that in practice the important part here is the actual pattern of subscriptions involved

@bzbarsky-apple No, it's a client app. Using it, we can promptly replicate this issue by opening and closing the app approximately 10 times. We can provide you with either an iOS or Android app, depending on your preference.

Alternately, we could add some new logging code for this, but let's see if we can build with the existing logging first.

The log we provided is already enable the detail logging: img_v2_bf65e590-7a95-4959-9114-4762f3bfcf1g

bzbarsky-apple commented 1 year ago

No, it's a client app.

OK, but then it needs to run against what server app? Does just running it against all-clusters-app on esp32 reproduce the problem?

The log we provided is already enable the detail logging:

Well, the log is not showing it enabled. @dhrishi @shubhamdp why is that not showing up properly?

Note that you need to enable detail logging in Matter, debug logging on the esp32 side and the IM pretty print, to get it to show up...

zzz2016 commented 1 year ago

@bzbarsky-apple We were unable to reproduce this issue in the official ESP32 lighting demo. However, we were able to consistently reproduce this issue in our own server code, which is based on the lighting demo with slight modifications.

We are currently investigating the differences between them one by one. At this point, my intuition is that it might be caused by the addition of our custom cluster in the server.

We will provide an update after further investigation.

We wanted to share this information with you for reference.

zzz2016 commented 1 year ago

@bzbarsky-apple We successfully reproduced the issue in the code of LightingExample. We discovered that in the commit b661b726c00303a60eb97eab4741e69a140b6e74, a new file was introduced: connectedhomeip/examples/lighting-app/esp32/main/include/CHIPProjectConfig.h In this file, there is a macro definition that forcibly changes the maximum number of subscriptions to 45:

define CHIP_IM_MAX_NUM_SUBSCRIPTIONS 45

After changing this value to 15 and repeating the previously mentioned process, we were able to reproduce the issue.

zzz2016 commented 1 year ago

This is our client app, you can use it to reproduce the issue: https://testflight.apple.com/join/EC8pqXvN

yyccz123 commented 1 year ago

@jtung-apple Please help to check this issue. Currently, it is related to the recovery mechanism and should be a common problem. Thank you.

bzbarsky-apple commented 1 year ago

I will try to reproduce with one of the esp-32 example apps, but again: a log that actually shows payloads would be extremely helpful.

zzz2016 commented 1 year ago

@bzbarsky-apple Our previous inability to print the payload in the logs was due to an issue with the log system. After examining the commit history, we found that this log problem was fixed in June, while our project branch was based on the commit from May. As a result, the payload has not been printed all this time.

Now, we have successfully resolved the issue, and the payload is being correctly printed in the logs. We utilized the provided TestFlight App and Example (with modifications to address the previously mentioned maximum subscription quantity), and then we were able to reproduce the problem. Below is the specific log: p0_log-2023-07-21_11_04_21.log

bzbarsky-apple commented 1 year ago

@zzz2016 I'm sorry for the lag. I have been trying to get esp32 builds working again, but the python bits are just broken and I can't build anything useful so far... @shubhamdp @dhrishi

And I have not been able to reproduce using the "linux" lighting-app on Mac, even after setting CHIP_CONFIG_MAX_FABRICS to 5 in src/lib/core/CHIPConfig.h and setting CHIP_SYSTEM_CONFIG_POOL_USE_HEAP to 0 in src/platform/Darwin/SystemPlatformConfig.h.

That said, one thing really confuses me in that last log that was attached. The part right before the crash is:

[2023-07-21 11:14:38.919] D (9164) chip[DMG]: SubscribeRequestMessage =
[2023-07-21 11:14:38.957] D (9164) chip[DMG]: {
[2023-07-21 11:14:38.957] D (9174) chip[DMG]:   KeepSubscriptions = false, 
...
[2023-07-21 11:14:39.002] D (9244) chip[DMG]: Failed to get required resources by evicting existing subscriptions.
...
[2023-07-21 11:14:39.032] abort() was called at PC 0x420413ad on core 0

But if KeepSubscriptions is false, we should just evict everything immediately and there should be absolutely no issue with the resources.... but I see no logs with "Deleting previous subscription from NodeId" that would correspond to that eviction happening.

@zzz2016 Would you be able to add more logging in InteractionModelEngine::EnsureResourceForSubscription so we can see exactly which codepaths are being taken there and what values of various things are in the logs?

zzz2016 commented 1 year ago

We can add logs on our end, but it's not an efficient debugging method. Since we need to release our product, we temporarily worked around this issue by setting chip_persist_subscriptions to false.

bzbarsky-apple commented 1 year ago

We can add logs on our end, but it's not an efficient debugging method.

Sure. Even better would be just stepping through what's going on in a debugger, but that requires a reliable way to reproduce. I will keep trying to do that too, like you I have time constraints and can't afford in the near future to spend a week or so trying to figure out how to reproduce this at all....

What really concerns me is this bit:

if KeepSubscriptions is false, we should just evict everything immediately and there should be absolutely no issue with the resources.

I don't understand why that would possibly not work. Something is very fishy here.

bzbarsky-apple commented 1 year ago

OK, so I finally got an esp32 build working on SHA 963a859806b8b5c2c030599cf47f2311b7431a51. I:

  1. Built lighting-app, using idf.py build flash monitor in examples/lighting-app/esp32 after modifying examples/lighting-app/esp32/main/include/CHIPProjectConfig.h to set CHIP_IM_MAX_NUM_SUBSCRIPTIONS to 15.
  2. Flashed it to an m5stack.
  3. Commissioned it as node id 17.
  4. Ran chip-tool in interactive mode.
  5. Ran onoff subscribe on-off 0 200 17 1 --keepSubscriptions true 15 times.
  6. Rebooted the m5stack.
  7. Observed it do Resuming 15 subscriptions in 0 seconds per the log.
  8. Ran onoff subscribe on-off 0 200 17 1 --keepSubscriptions false

I see the "Deleting previous subscription" log a bunch of times, then the new subscription is set up properly....

zzz2016 commented 1 year ago

@bzbarsky-apple You can try using our App to attempt to reproduce the issue: https://testflight.apple.com/join/EC8pqXvN We are using the system-integrated Matter SDK.

bzbarsky-apple commented 1 year ago

@zzz2016 I am unfortunately not in a position where I can install that app on any device I have access to in the near future....

zzz2016 commented 1 year ago

This is too troublesome. We are also unable to reproduce it using ChipTool. We suspect it's due to the differences between MobileSDK and ChipTool.

bzbarsky-apple commented 1 year ago

We are also unable to reproduce it using ChipTool.

That is actually really useful information... I'll see what I can do about trying to reproduce this, then, but I am on vacation for ~2 weeks starting Monday, so probably will not get to this until I get back and dig out from whatever emergencies came up. @jtung-apple in the meantime, if you find time for this please feel free to look into it!