Improve robustness of polygon fetch by static polygon filter

erwinbonsmatopic commented 2 years ago

During integration tests using the new static polygon filter we faced problems with time outs when listening for the required transform.

This PR fixes that by, firstly, not putting restrictions on the time of the requested transform (it is static anyway), and secondly, increasing the hard-coded timeout limit to sixty seconds. This should suffice. In our testing it required at most ten seconds.

YannickdeHoop commented 2 years ago

ping @jonbinney

jonbinney commented 2 years ago

Sorry it took me a while to get to this! I'd like to understand a bit more about why these changes are needed on your system. A couple of thoughts:

if the transform is truly static (published on tf_static) then the timestamp for the transform lookup shouldn't matter.... right? I don't understand why changing the timestamp to Time(0) would help anything.
10 seconds is a reallly really long time. Is this happening only at startup? On most systems, if a static transform isn't ready after a few seconds, it should indicate that something is wrong. Are you using the standard robot_state_publisher? How loaded is your CPU and memory?

erwinbonsmatopic commented 2 years ago

10 seconds is a reallly really long time. Is this happening only at startup? On most systems, if a static transform isn't ready after a few seconds, it should indicate that something is wrong. Are you using the standard robot_state_publisher? How loaded is your CPU and memory?

The large time-out is indeed a work-around for a problem elsewhere. We have a experimental ROS configuration to explore performance. The performance varies widely on different systems. Normally the transform is obtained with 100ms but one system requires a time-out of fifteen seconds. If the timeout is less, e.g. five seconds, the transform publisher gets into a state where it consumes 100% CPU and becomes very sluggish (it can still accept new subscribers, but takes long to respond). See the CPU usage plot below. Note, the same set-up runs fine on a system that has the same processor architecture (ARM) and less computational power. It appears there is a bug in the ros_comm pub-sub infrastructure, but that we just started to analyze. Hopefully we can identify the root cause (and provide a fix).

TransformPublishFailure

Meanwhile, if there is a hardcoded time-out, it should be large for this code to be (more) robust to failures elsewhere. Alternatively, we can make it configurable (which is actually what we are using in our experiment set-up)

if the transform is truly static (published on tf_static) then the timestamp for the transform lookup shouldn't matter.... right? I don't understand why changing the timestamp to Time(0) would help anything.

The Time(0) ensures that whenever a message is received, it is never rejected. The problem with the previous code was that it can reject messages. This should not happen in happy scenarios, but can happen in situations where the subscription takes longer than expected. In this case, we have seen it receive a transform but reject it. As the transform is static, there is no need for this. So changing time to Time(0) makes this code more robust.

jonbinney commented 2 years ago

Thanks for the detailed explanation. A couple thoughts:

I think I understand why the Time(0) helps you now. Unfortunately there are robots with laser scanners on actuated joints (for example the upper hokoyu on the PR2). Using the correct timestamp ensures that those robots use the correct transform when filtering.
Making the timeout extremely long makes the system more robust in some cases like yours, but also makes errors take longer to spot for many other robots. Consider the cases where a robot normally starts publishing transforms within a few seconds (most robots i think). If something is really wrong, like a joint not providing updated positions because of a broken encoder, it will now take 60 seconds for the laser filter to give up. If the system has a problem, I'd like it to notice as quickly as possible.

How about a special case for startup? For the first N seconds, if the filter fails to find a transform for a laser scan, it doesn't publish a filtered scan, and it outputs a warning to ros logging. After N seconds, the filter logs an error and exits. I'd still like N to be 20 seconds or less.

erwinbonsmatopic commented 2 years ago

Thanks for your feedback, Jon.

An important thing to note though, this code is for the static polygon filter, where the transform does not change and is obtained only once. Your example with actuated joints therefore does not apply. For this, the other polygon filter should be used (defined in the same file), which continuously receives polygon messages.

So the static filter only has the start-up use-case. Once it has obtained the transform it unsubscribes (to avoid the overhead of these incoming transform messages, which was noticeable).

Your point is valid that the currently hard-coded time-out of sixty seconds is too long is valid and unnecessarily delays feedback in case of errors. I therefore propose to make the time-out configurable and reduce the default to five seconds (this will in exceptional cases not suffice, but can then be increased).

erwinbonsmatopic commented 2 years ago

I have extended the PR to reduce the default time-out and make it configurable. Also, a minor change in the header to fix a build warning (a regression introduced in 35a748eb).

jonbinney commented 2 years ago

Good point about this being the static version :-) Just a few logging changes from the code review, then ready to merge.

erwinbonsmatopic commented 2 years ago

I also made some minor formatting changes so that formatting is consistent.

This caused a minor conflict with noetic-devel branch, which I resolved by merging noetic-devel into this branch. Let me know if you would instead prefer me to rebase this branch

jonbinney commented 2 years ago

Did you mean to commit the dockerfile?

erwinbonsmatopic commented 2 years ago

The Dockerfile should not be added. That was a mistake. I will remove it. How do you prefer I do this? Via a separate commit, or by changing the commit that introduced it?

jonbinney commented 2 years ago

Go ahead and modify the commit - thanks for asking!

jonbinney commented 2 years ago

Looks good! @erwinbonsmatopic is this ready to merge?

erwinbonsmatopic commented 2 years ago

@jonbinney Yes, this is ready to merge. I performed a few integration tests today and that also went fine. Thanks

ros-perception / laser_filters

Improve robustness of polygon fetch by static polygon filter #155