Closed clalancette closed 2 months ago
I should note that this was on the advice of the Zenoh developers in Discord.
OK, so I've now updated this PR to include all of the fixes necessary to fix our use-after-free bug here. I've also updated the title and the initial description to explain what all of this is doing. And there is another comment in the code explaining why we need all of this. Please take another look and let me know what you think.
Wow this is quite the tricky problem. Thanks for finding this suitable fix.
Yeah, agreed. This one is tricky. Let's hold off on merging this one for a little bit to give the Zenoh developers a chance to take a look and comment.
num_in_flight_
should be decremented in the drop function of thez_owned_closure_reply_t
.
Thanks. This was a very helpful comment; I didn't realize that drop
was one of the parameters to the closure creation. That is absolutely the right place to free the memory.
However, looking at this closer leads me to another question. As you can see in this PR, we create a new closure and give ownership of it to Zenoh via z_get
for every query:
z_owned_closure_reply_t zn_closure_reply =
z_closure(rmw_zenoh_cpp::client_data_handler, nullptr, client_data);
z_get(
z_loan(context_impl->session),
z_loan(client_data->keyexpr), "",
z_move(zn_closure_reply),
&opts);
(this is more-or-less required by the z_get
API). But the documentation in zenoh-c
about closures says this:
* It is guaranteed that:
*
* - `call` will never be called once `drop` has started.
* - `drop` will only be called **once**, and **after every** `call` has ended.
* - The two previous guarantees imply that `call` and `drop` are never called concurrently.
That suggests to me that it is possible to use the same closure in multiple z_get
calls. Is that correct? Would it be better for us to create the closure when we initialize the client (rmw_create_client
), and then reuse it for every z_get
with this client (rmw_send_request
)? If so, what would that look like, since the API seems to require us to z_move
the closure?
num_in_flight_
should be decremented in the drop function of thez_owned_closure_reply_t
.Thanks. This was a very helpful comment; I didn't realize that
drop
was one of the parameters to the closure creation. That is absolutely the right place to free the memory.However, looking at this closer leads me to another question. As you can see in this PR, we create a new closure and give ownership of it to Zenoh via
z_get
for every query:z_owned_closure_reply_t zn_closure_reply = z_closure(rmw_zenoh_cpp::client_data_handler, nullptr, client_data); z_get( z_loan(context_impl->session), z_loan(client_data->keyexpr), "", z_move(zn_closure_reply), &opts);
(this is more-or-less required by the
z_get
API). But the documentation inzenoh-c
about closures says this:* It is guaranteed that: * * - `call` will never be called once `drop` has started. * - `drop` will only be called **once**, and **after every** `call` has ended. * - The two previous guarantees imply that `call` and `drop` are never called concurrently.
That suggests to me that it is possible to use the same closure in multiple
z_get
calls. Is that correct? Would it be better for us to create the closure when we initialize the client (rmw_create_client
), and then reuse it for everyz_get
with this client (rmw_send_request
)? If so, what would that look like, since the API seems to require us toz_move
the closure?
I believe that what creates the confusion here is that in Zenoh one single z_get
can lead to 0, 1 or multiple replies. Which results in the API to 0, 1 or multiple calls to call
and finally one call to drop
to indicate to the user that there are no more replies for this single z_get
. With that in mind, the documentation does not necessarily suggests that the same closure can be used in multiple z_get
.
So I believe that you indeed should not reuse the same closure for multiple z_get
. I'll double check that and come back to you if otherwise but for now you can go with this assumption.
So I believe that you indeed should not reuse the same closure for multiple
z_get
. I'll double check that and come back to you if otherwise but for now you can go with this assumption.
Fantastic, thank you for that explanation. I'll continue with that assumption, then.
All right, I believe I responded to all comments now. Please take another look when you get a chance.
It turns out that in Zenoh, once a query is in-flight (via the z_get zenoh-c API), there is no way to cancel it. With that in mind, consider a sequence of events:
If all of that happens before the query callback is called, we'll have freed up the structure that we passed to the zenoh closure (client_data), and thus the structure will access already freed memory.
The fix for this is relatively complicated. First of all, anytime that there is a call to rmw_send_request (which calls z_get()), it increases a per-client counter. rmw_destroy_client() checks to to see if that counter is 0. If it is, we know both that there are no queries in flight, and also that there will be no new ones (because we are cleaning the client up). Thus the structure is freed directly. If the counter is greater than 0, then there is at least one query in flight and it is not safe to free the structure. However, the client_data structure is marked as "shutdown" at this point. There will not be any new requests for this client, but the in-flight ones still need to be dealt with.
For the in-flight ones, every time client_data_handler() (the rmw_zenoh_cpp callback for queries) is called, it first decrements the number of in-flight queries. If the client is shutdown, and the number of in-flight queries is 0, then it is safe to free the client_data structure. If the client is shutdown but there are other queries in flight, no actual work is done except for the decrement.
There is one case which is not handled here at all, and that has to do with timeouts. Currently the rmw_zenoh_cpp client query timeout is set to essentially infinite. Thus, if a query is in-flight, but never returns, the memory corresponding to that client will be leaked. However, this is already an existing problem; this patch changes that from a UB to a memory leak.
In my testing, this fixes the problems pointed out in #186