Fix replies coming in after a client is destroyed.

clalancette commented 3 months ago

It turns out that in Zenoh, once a query is in-flight (via the z_get zenoh-c API), there is no way to cancel it. With that in mind, consider a sequence of events:

rmw_create_client
rmw_send_request
rmw_destroy_client

If all of that happens before the query callback is called, we'll have freed up the structure that we passed to the zenoh closure (client_data), and thus the structure will access already freed memory.

The fix for this is relatively complicated. First of all, anytime that there is a call to rmw_send_request (which calls z_get()), it increases a per-client counter. rmw_destroy_client() checks to to see if that counter is 0. If it is, we know both that there are no queries in flight, and also that there will be no new ones (because we are cleaning the client up). Thus the structure is freed directly. If the counter is greater than 0, then there is at least one query in flight and it is not safe to free the structure. However, the client_data structure is marked as "shutdown" at this point. There will not be any new requests for this client, but the in-flight ones still need to be dealt with.

For the in-flight ones, every time client_data_handler() (the rmw_zenoh_cpp callback for queries) is called, it first decrements the number of in-flight queries. If the client is shutdown, and the number of in-flight queries is 0, then it is safe to free the client_data structure. If the client is shutdown but there are other queries in flight, no actual work is done except for the decrement.

There is one case which is not handled here at all, and that has to do with timeouts. Currently the rmw_zenoh_cpp client query timeout is set to essentially infinite. Thus, if a query is in-flight, but never returns, the memory corresponding to that client will be leaked. However, this is already an existing problem; this patch changes that from a UB to a memory leak.

In my testing, this fixes the problems pointed out in #186

clalancette commented 3 months ago

I should note that this was on the advice of the Zenoh developers in Discord.

clalancette commented 3 months ago

OK, so I've now updated this PR to include all of the fixes necessary to fix our use-after-free bug here. I've also updated the title and the initial description to explain what all of this is doing. And there is another comment in the code explaining why we need all of this. Please take another look and let me know what you think.

clalancette commented 3 months ago

Wow this is quite the tricky problem. Thanks for finding this suitable fix.

Yeah, agreed. This one is tricky. Let's hold off on merging this one for a little bit to give the Zenoh developers a chance to take a look and comment.

clalancette commented 3 months ago

num_in_flight_ should be decremented in the drop function of the z_owned_closure_reply_t.

Thanks. This was a very helpful comment; I didn't realize that drop was one of the parameters to the closure creation. That is absolutely the right place to free the memory.

However, looking at this closer leads me to another question. As you can see in this PR, we create a new closure and give ownership of it to Zenoh via z_get for every query:

  z_owned_closure_reply_t zn_closure_reply =
    z_closure(rmw_zenoh_cpp::client_data_handler, nullptr, client_data);
  z_get(
    z_loan(context_impl->session),
    z_loan(client_data->keyexpr), "",
    z_move(zn_closure_reply),
    &opts);

(this is more-or-less required by the z_get API). But the documentation in zenoh-c about closures says this:

 * It is guaranteed that:
 *
 *   - `call` will never be called once `drop` has started.
 *   - `drop` will only be called **once**, and **after every** `call` has ended.
 *   - The two previous guarantees imply that `call` and `drop` are never called concurrently.

That suggests to me that it is possible to use the same closure in multiple z_get calls. Is that correct? Would it be better for us to create the closure when we initialize the client (rmw_create_client), and then reuse it for every z_get with this client (rmw_send_request)? If so, what would that look like, since the API seems to require us to z_move the closure?

OlivierHecart commented 3 months ago

num_in_flight_ should be decremented in the drop function of the z_owned_closure_reply_t.

Thanks. This was a very helpful comment; I didn't realize that drop was one of the parameters to the closure creation. That is absolutely the right place to free the memory.

However, looking at this closer leads me to another question. As you can see in this PR, we create a new closure and give ownership of it to Zenoh via z_get for every query:
  z_owned_closure_reply_t zn_closure_reply =
    z_closure(rmw_zenoh_cpp::client_data_handler, nullptr, client_data);
  z_get(
    z_loan(context_impl->session),
    z_loan(client_data->keyexpr), "",
    z_move(zn_closure_reply),
    &opts);
(this is more-or-less required by the z_get API). But the documentation in zenoh-c about closures says this:
 * It is guaranteed that:
 *
 *   - `call` will never be called once `drop` has started.
 *   - `drop` will only be called **once**, and **after every** `call` has ended.
 *   - The two previous guarantees imply that `call` and `drop` are never called concurrently.
That suggests to me that it is possible to use the same closure in multiple z_get calls. Is that correct? Would it be better for us to create the closure when we initialize the client (rmw_create_client), and then reuse it for every z_get with this client (rmw_send_request)? If so, what would that look like, since the API seems to require us to z_move the closure?

I believe that what creates the confusion here is that in Zenoh one single z_get can lead to 0, 1 or multiple replies. Which results in the API to 0, 1 or multiple calls to call and finally one call to drop to indicate to the user that there are no more replies for this single z_get. With that in mind, the documentation does not necessarily suggests that the same closure can be used in multiple z_get. So I believe that you indeed should not reuse the same closure for multiple z_get. I'll double check that and come back to you if otherwise but for now you can go with this assumption.

clalancette commented 3 months ago

So I believe that you indeed should not reuse the same closure for multiple z_get. I'll double check that and come back to you if otherwise but for now you can go with this assumption.

Fantastic, thank you for that explanation. I'll continue with that assumption, then.

clalancette commented 3 months ago

All right, I believe I responded to all comments now. Please take another look when you get a chance.

ros2 / rmw_zenoh

Fix replies coming in after a client is destroyed. #228