project-chip / connectedhomeip

Matter (formerly Project CHIP) creates more connections between more objects, simplifying development for manufacturers and increasing compatibility for consumers, guided by the Connectivity Standards Alliance.
https://buildwithmatter.com
Apache License 2.0
7.43k stars 1.99k forks source link

Strange buffer pointers in LwIPReceiveUDPMessage on ESP32 #19393

Open kpschoedel opened 2 years ago

kpschoedel commented 2 years ago

Problem

Derived from #19289 — added some pbuf logging on master and ran all-clusters-app on M5Stack (patch and sample log attached).

Every pbuf handed to LwIPReceiveUDPMessage() seems to have a bad payload pointer, outside the buffer. The pointers vary from run to run but are somewhat consistent within a run; for the same pbuf, the payload pointer is often (but not always) the same. The payload pointers always seem to be moderately close to the pbuf (within 32K), but seem to never be within the buffer.

Examples extracted from the attached log —

      pbuf pointer   pbuf->payload, offset len         AllocSize
      -------------- --------------------- ----------- -----------
UDP   pb: 0x3fff07c0 pay: 0x3fff099c   476 len: 0x0024 asz: 0x05D8
UDP   pb: 0x3fff07c0 pay: 0x3fff099c   476 len: 0x0028 asz: 0x05D8
UDP   pb: 0x3fff07c0 pay: 0x3fff099c   476 len: 0x0046 asz: 0x05D8
UDP   pb: 0x3fff07c0 pay: 0x3fff099c   476 len: 0x0056 asz: 0x05D8
UDP   pb: 0x3fff07c0 pay: 0x3fff099c   476 len: 0x005E asz: 0x05D8
UDP   pb: 0x3fff07c0 pay: 0x3fff099c   476 len: 0x00D8 asz: 0x05D8
UDP   pb: 0x3fff07c0 pay: 0x3fff099c   476 len: 0x017A asz: 0x05D8
UDP   pb: 0x3fff07c0 pay: 0x3fff099c   476 len: 0x017B asz: 0x05D8
UDP   pb: 0x3fff2ec4 pay: 0x3fff348c  1480 len: 0x0028 asz: 0x05D8
UDP < pb: 0x3fff23d0 pay: 0x3fff0f8c -5188 len: 0x0024 asz: 0x05D8
UDP < pb: 0x3fff23d0 pay: 0x3fff0f8c -5188 len: 0x0028 asz: 0x05D8
UDP < pb: 0x3fff23d0 pay: 0x3fff0f8c -5188 len: 0x0046 asz: 0x05D8
UDP < pb: 0x3fff23d0 pay: 0x3fff0f8c -5188 len: 0x005E asz: 0x05D8
UDP < pb: 0x3fff23d0 pay: 0x3fff0f8c -5188 len: 0x00A9 asz: 0x05D8
UDP < pb: 0x3fff23d0 pay: 0x3fff0f8c -5188 len: 0x00D6 asz: 0x05D8
UDP < pb: 0x3fff23d0 pay: 0x3fff0f8c -5188 len: 0x017A asz: 0x05D8
UDP < pb: 0x3fff2eb8 pay: 0x3fff0f8e -7978 len: 0x001A asz: 0x05D8
UDP > pb: 0x3ffee54c pay: 0x3fff1122 11222 len: 0x00B1 asz: 0x05D8
UDP > pb: 0x3fff0224 pay: 0x3fff3e06 15330 len: 0x0033 asz: 0x05D8
UDP > pb: 0x3fff0238 pay: 0x3fff0f8e  3414 len: 0x02E0 asz: 0x05D8
UDP > pb: 0x3fff0238 pay: 0x3fff5446 21006 len: 0x0033 asz: 0x05D8
UDP > pb: 0x3fff05e4 pay: 0x3fff0f8c  2472 len: 0x004D asz: 0x05D8
UDP > pb: 0x3fff05e4 pay: 0x3fff0f8c  2472 len: 0x005E asz: 0x05D8
UDP > pb: 0x3fff05e4 pay: 0x3fff0f8c  2472 len: 0x017A asz: 0x05D8
UDP > pb: 0x3fff05e4 pay: 0x3fff3762 12670 len: 0x0022 asz: 0x05D8
UDP > pb: 0x3fff05e4 pay: 0x3fff3762 12670 len: 0x002A asz: 0x05D8
UDP > pb: 0x3fff05e4 pay: 0x3fff3d56 14194 len: 0x02E0 asz: 0x05D8
UDP > pb: 0x3fff07fc pay: 0x3fff1040  2116 len: 0x0024 asz: 0x05D8
UDP > pb: 0x3fff07fc pay: 0x3fff1040  2116 len: 0x00A4 asz: 0x05D8
UDP > pb: 0x3fff07fc pay: 0x3fff1040  2116 len: 0x00A9 asz: 0x05D8
UDP > pb: 0x3fff07fc pay: 0x3fff1040  2116 len: 0x00D6 asz: 0x05D8
UDP > pb: 0x3fff07fc pay: 0x3fff1040  2116 len: 0x017A asz: 0x05D8
UDP > pb: 0x3fff07fc pay: 0x3fff1040  2116 len: 0x017B asz: 0x05D8
UDP > pb: 0x3fff0818 pay: 0x3fff348c 11380 len: 0x0046 asz: 0x05D8
UDP > pb: 0x3fff0818 pay: 0x3fff348c 11380 len: 0x00D6 asz: 0x05D8
UDP > pb: 0x3fff194c pay: 0x3fff3b32  8678 len: 0x0033 asz: 0x05D8
UDP > pb: 0x3fff194c pay: 0x3fff43fa 10926 len: 0x02E0 asz: 0x05D8
UDP > pb: 0x3fff1960 pay: 0x3fff3b30  8656 len: 0x004D asz: 0x05D8
UDP > pb: 0x3fff23d0 pay: 0x3fff348c  4284 len: 0x00D8 asz: 0x05D8
UDP > pb: 0x3fff23d0 pay: 0x3fff348e  4286 len: 0x0022 asz: 0x05D8
UDP > pb: 0x3fff2c0c pay: 0x3fff47ae  7074 len: 0x02E0 asz: 0x05D8
UDP > pb: 0x3fff2cec pay: 0x3fff41d4  5352 len: 0x0090 asz: 0x05D8
UDP > pb: 0x3fff2d00 pay: 0x3fff3b30  3632 len: 0x00D6 asz: 0x05D8
UDP > pb: 0x3fff2ec4 pay: 0x3fff348c  1480 len: 0x0090 asz: 0x05D8
UDP > pb: 0x3fff2ec4 pay: 0x3fff348c  1480 len: 0x00A4 asz: 0x05D8
UDP > pb: 0x3fff2ec4 pay: 0x3fff348c  1480 len: 0x00D6 asz: 0x05D8
UDP > pb: 0x3fff2ec4 pay: 0x3fff348c  1480 len: 0x017B asz: 0x05D8
UDP > pb: 0x3fff2ec4 pay: 0x3fff3b32  3182 len: 0x0033 asz: 0x05D8
UDP > pb: 0x3fff2ec4 pay: 0x3fff3e06  3906 len: 0x0033 asz: 0x05D8
UDP > pb: 0x3fff2ec4 pay: 0x3fff43fa  5430 len: 0x0033 asz: 0x05D8

0001-LOG.patch.txt log.chip-all-clusters-app.20220608181346.txt

kpschoedel commented 2 years ago

@dhrishi

tcarmelveilleux commented 2 years ago

The payload pointer may be outside the pbuf depending how it's allocated. You'd have to trace the path for incoming packets

kpschoedel commented 2 years ago

The payload pointer may be outside the pbuf depending how it's allocated. You'd have to trace the path for incoming packets

It looks like that is the case; the incoming buffers are PBUF_REF with the payload somewhere else.

The current PacketBuffer code, and the LwIP receive path, fundamentally assume that the pbuf and payload are part of one contiguous block, and use the space in betwwn, so this isn't going to work.

bzbarsky-apple commented 2 years ago

Oh, wow. So the fact that it ever worked at all was basically just luck? We were stomping on parts of the pbuf pool, but they happened to be ok to stomp on, by accident???

kpschoedel commented 2 years ago

I think the reason it ‘works’ is that LwIP reads the whole low-level frame into a buffer, and then consumes up through the IP header, and returns a payload pointer to that point. So when the code reuses that space, by accessing it relative to the payload pointer, it works. #19244 tried accessing it relative to the header, and crashed.

EnsureReservedSize() always claims success, because ReservedSize() casts the pointer difference to unsigned, so everything looks like a large reserve.

I don't think there's any hope of replacing all the ‘reserve’ code and uses for 1.0, so I think the path forward is:

bzbarsky-apple commented 2 years ago

So for incoming pbufs, the only thing we use the reserve stuff for is the packetinfo, right? I guess the problem is how we send through the packetbuffer and the packetinfo to the right Matter thread. Seems like the only options are:

  1. We do what you suggest above, send the packetinfo in the reserved space. This probably requires LwIP to do a copy of the data, right?
  2. We keep doing what we do now, with static asserts that a packetinfo is always smaller than the IP header .... Seems dubious.
  3. We allocate (heap? pool?) a struct that has the packetinfo and the pointer to the pbuf, and then pass that to the Matter thread. That avoids the pbuf copy but needs the extra allocation... which is likely smaller than the space we would need for the pbuf copy. In this setup, PacketBuffer::New can still check PBUF_TYPE_FLAG_STRUCT_DATA_CONTIGUOUS, because I think it should hold there....
kpschoedel commented 2 years ago
  1. We do what you suggest above, send the packetinfo in the reserved space. This probably requires LwIP to do a copy of the data, right?

That may be so, though I don't know LwIP internals. (I speculate that it may be using the separate-payload model because some hardware may make a page-aligned payload advantageous.)

  1. We keep doing what we do now, with static asserts that a packetinfo is always smaller than the IP header .... Seems dubious.
  2. We allocate (heap? pool?) a struct that has the packetinfo and the pointer to the pbuf, and then pass that to the Matter thread.

Yes, @tcarmelveilleux suggested similar in #17213. But I naïvely thought that just fixing the alignment would be enough for now…

If the reserve concept doesn't get ripped out entirely (which would definitely simplify PacketBuffer), then ReservedSize() should also check for the CONTIGUOUS flag and return zero otherwise.

bzbarsky-apple commented 2 years ago

If the reserve concept doesn't get ripped out entirely

I think we use it quite a bit on the save path to pre-reserve space for our headers....

then ReservedSize() should also check for the CONTIGUOUS flag and return zero otherwise.

Yes, that would make a lot of sense.

gjc13 commented 2 years ago

I checked how IPPacketInfo class is used in UDPEndPointImplLwIP.cpp. The packet info is copied in UDPEndPointImplLwIP::HandleDataReceived.

We can modify the signature to void HandleDataReceived(chip::System::PacketBufferHandle && aBuffer, const PacketInfo &aPacketInfo); and pass the packet info explicitly in the lambda.

Code will be like:

  364     IPPacketInfo pktInfo;
  365     pktInfo->SrcAddress  = IPAddress(*addr);
  366     pktInfo->DestAddress = IPAddress(*ip_current_dest_addr());
  367     pktInfo->Interface   = InterfaceId(ip_current_netif());
  368     pktInfo->SrcPort     = port;
  369     pktInfo->DestPort    = pcb->local_port;
  370 
  371     ep->Retain();
  372     CHIP_ERROR err = ep->GetSystemLayer().ScheduleLambda([ep, p = System::LwIPPacketBufferView::UnsafeGetLwIPpbuf(buf), pktInfo] {
  373         ep->HandleDataReceived(System::PacketBufferHandle::Adopt(p), pktInfo);
  374         ep->Release();
  375     }); 

Will this eliminate the need to access the IP and link layer header of the pbuf? Doing such thing is always dangerous.

bzbarsky-apple commented 2 years ago

That's basically sweeping the heap-allocation of the IPPacketInfo under the rug into the lambda allocation, right? I would rather make any sort of sizeable allocations explicit, so if they fail we gracefully return error instead of throwing uncaught exceptions or whatever happens if we fail to allocate the lambda....

gjc13 commented 2 years ago

@kpschoedel In the SVE test the issue showed up again in group message tests.

In SessionManager.cpp we are cloning the packet buffer:

659     while (!decrypted && iter->Next(groupContext))
660     {
661         // Optimization to reduce number of decryption attempts
662         if (groupId != groupContext.group_id)
663         {
664             continue;
665         }
666         msgCopy = msg.CloneData();
667         CryptoContext::NonceStorage nonce;
668         CryptoContext::BuildNonce(nonce, packetHeader.GetSecurityFlags(), packetHeader.GetMessageCounter(),
669                                   packetHeader.GetSourceNodeId().Value());
670         decrypted = (CHIP_NO_ERROR ==
671                      SecureMessageCodec::Decrypt(CryptoContext(groupContext.key), nonce, payloadHeader, packetHeader, msgCopy));
672     }

The msg.CloneData() fails because the pbuf does not fit in the packet buffer's memory model at all. Raised https://github.com/project-chip/connectedhomeip/pull/20923 as a quick fix.