u-blox / ubxlib

Portable C libraries which provide APIs to build applications with u-blox products and services. Delivered as add-on to existing microcontroller and RTOS SDKs.
Apache License 2.0
303 stars 92 forks source link

TLS context leak #184

Closed FloriVeit closed 8 months ago

FloriVeit commented 8 months ago

We are currently in the process of closing sockets that have a TLS context. In our closing function, both uCellSockClose and the cleanup function uSockCleanUp are invoked. However, the TLS context remains in the gpcontextlist. It cannot be freed from the asynchronous callback closedCallback because the socket information has already been cleared by the cleanup. Furthermore, the cleanup does not clear the TLS context. Over time, the list becomes full and no socket with a TLS context can be created. This leads to an ‘out of memory’ error. The question then is whether we should use this cleanup, and if so, when would be the appropriate time to do so?

RobMeades commented 8 months ago

Hi there, thanks for posting and sorry your having problems with this.

Just to clarify the sequence, I guess you're following the kind of pattern used in the example main_tls.c, which is as follows:

[...noting that in this pattern uCellSockClose() will be called by the uSockClose() function (i.e. it wouldn't be called directly)].

Is that a correct understanding of your flow, and in doing this the memory for the TLS socket is not being marked as free?

RobMeades commented 8 months ago

Supplementary question: which cellular module type is this being run with?

FloriVeit commented 8 months ago

Good morning,

Thank you for your quick response. Indeed, you are correct. My apologies for the oversight; we naturally call uSockClose, not uCellSockClose. We are utilizing the SARA R422-M8S modem.

When we encounter issues with the server, the TLS context is not eliminated from the gpContextList. The Socket from the gStaticContainers has already been freed and removed from the list with uSockCleanUp. However, the closedCallback (from u_Sock.c) gets called later and pContainer is then NULL.

I have now attempted without the uSockCleanUp, and everything was cleaned up accordingly.

RobMeades commented 8 months ago

Excellent: just to be clear then, I think the issue was as follows (please correct me if I'm wrong):

If this is correct I can update the documentation of the uSockCleanUp() function to point out the loop-hole. Note that if you happen to be opening no more than U_SOCK_NUM_STATIC_SOCKETS (7) sockets, uSockCleanUp() won't actually do anything anyway, since no malloc() will have been needed.

ghost commented 8 months ago

Hey @RobMeades, just wanted to let you know that Florian (@FloriVeit) is my successor and will probably report more issues in the future 😄

Your description is right, that's the problem we observed.

I have two questions regarding this issue:

  1. Is it right that we don't need to call uSockCleanUp() at all? If I remember correctly, SARA R422 only supports 7 sockets anyway, so I guess we shouldn't even be able to exceed U_SOCK_NUM_STATIC_SOCKETS, right?
  2. I understand that calling uSockCleanUp() before the socket was fully closed on the modem side might be undefined behavior, but I wonder if that behavior of uSockCleanUp() makes sense in this case. Wouldn't it be better to just free the TLS context in uSockCleanUp() to avoid theleak? In its current state it's pretty hard for a complex application to safely use uSockCleanUp(), e.g. when multiple threads on a system use the socket layer.

Thanks for your quick answers as always!

RobMeades commented 8 months ago

Oh, hi @johannesmay, I've got the context now.

I was probably over-thinking when I introduced uSockCleanUp() in the first place: you are correct that, for the cellular case at least, we have U_SOCK_NUM_STATIC_SOCKETS set to the maximum number of sockets that the current set of cellular modules support (to remove the risk of malloc()ed/free()ed memory being referenced); I just wanted to be generic for future cellular modules, or maybe short-range modules (which uses the same code for Wi-Fi-based sockets, and I had heard that there was an extension to BLE in the wings which would do sockets) that might want more simultaneous connections.

But I can describe this in the function header of uSockCleanUp() also, so that people don't have to read the code to understand what's going on.

On the "when to free" topic, I'm basically sh*t scared of asynchronicity holes; bastards to find/debug, so I always err on the side of leaving some RAM hanging, rather than free()ing: the way things are coded [or at least, the intention of the coder :-)] is that any memory that has been allocated and not yet free()ed, when more stuff is required, will be re-used in that new stuff; it doesn't "build up", if you like, it is just an extending pool, so calling uSockCleanUp() would likely be something you do at end of day, when your application is done with comms, that kind of thing. I can explain this in the function header also.

ghost commented 8 months ago

On the "when to free" topic, I'm basically sh*t scared of asynchronicity holes; bastards to find/debug, so I always err on the side of leaving some RAM hanging, rather than free()ing: the way things are coded [or at least, the intention of the coder :-)] is that any memory that has been allocated and not yet free()ed, when more stuff is required, will be re-used in that new stuff; it doesn't "build up", if you like, it is just an extending pool, so calling uSockCleanUp() would likely be something you do at end of day, when your application is done with comms, that kind of thing. I can explain this in the function header also.

Sure, that makes sense. You basically have a kind of vector-linked-list hybrid where nodes in the list may be empty and reused later on - so far so good. But I think what we are seeing is not that - in pNewContext() in u_cell_sec_tls.c you allocate the cellular TLS context. The pointers to all TLS contexts are maintained using the static gpContextList. Our issue is that after causing that leak using uSockCleanUp() the gpContextList ends up with lingering contexts that are not referenced by any sockets anymore. The issue is that once that list is full, no more sockets can enable TLS, because pNewContext() will return NULL and thus pUCellSecSecTlsAdd() returns U_ERROR_COMMON_NO_MEMORY.

That was the root cause that triggered our investigation - after a few reconnections from our side, no connection is possible at all without restarting the device. I would have expected uSockCleanup() to free the TLS context for static sockets it is setting to the CLOSED state here: https://github.com/u-blox/ubxlib/blob/master/common/sock/src/u_sock.c#L1499-L1504

RobMeades commented 8 months ago

Hmm, yes, I think I'm conflating socket free()ing with TLS free()ing. Lemme take a little while to understand my own code...

RobMeades commented 8 months ago

Apologies for the delay: I have made the changes to hopefully free the security context if uSockCleanUp() happens to be called while a secured socket is in the "closing but not yet closed" state. I have pushed a preview branch of the change here:

https://github.com/u-blox/ubxlib/tree/preview_sock_sec_free_rmea

Would you be able to take a look and confirm that the change is as you expect? If you are OK with it then I will push the change to the master branch here and delete the preview branch.

FloriVeit commented 8 months ago

I tested it, and it works flawlessly to fix the problem. Thanks for the fix. Unfortunately, we don’t have C++ support in our project. With the u_geofence_geodesic.cpp, we can’t build our project anymore. I have opened a separate issue for this.

RobMeades commented 8 months ago

Excellent, thanks, once this is approved internally I will get it pushed to master here and delete the preview branch.

Will also address the C++ thingy...

RobMeades commented 8 months ago

The fix is now pushed to master here in commit 611f053baff238dc0c5ae3fa7e9019d949f38590.