ClearEntireCostmap returns before all costmap layer include valid data

schupf2 commented 2 weeks ago

Bug report

Required Info:

Operating System:
- Ubuntu 22.04.5 LTS
ROS2 Version:
- Humble binaries
Version or commit hash:
- 11.1.13-0jammy
DDS implementation:
- Cyclonedds

Steps to reproduce issue

When working with a costmap with more then one different layers (in our case, we use ["static_layer", "obstacle_layer", "inflation_layer"]), we face a problem after clearing the costmap: The the nav2_msgs::srv::ClearEntireCostmap service returns after the costmap is cleared. However, at this point no scanner measurements are considered in the obstacle layer. Thus, the space which is not considered in the static layer is marked as free for a short time until the next scanner measurement is included in the obstacle layer.

Expected behavior

There should be an option to either enforce waiting on information in all layers being present before returning from the ClearEntireCostmap service call, or some option to verify the presence of information on all layers.

Actual behavior

After the ClearEntireCostmap service call the costmap only shows values form that static layer and no scanner measurements are yet considered in the obstacle layer. Therefore, obstacles in front of the robot are ignored for a brief time.

SteveMacenski commented 2 weeks ago

That is why there is a ‘isCurrent’ API that is checked before planning or control is allowed to make sure there is new data before the cost map is allowed to be used. Why the cost map might be in a state without valid data after a clear, it is not used until the data is again valid for all layers. Does that make sense?

schupf2 commented 2 weeks ago

Thank you for the quick answer - that definitely makes sense :) . However, if one uses the the collision checker directly (and not the whole controller) - the check is not performed and one needs to get a hold of the LayeredCostmap or the Costmap2DROS in order to be able to perform the check. So maybe a service (similar to e.g., ClearEntireCostmap) that waits for isCurrent to evaluate to true, might be a nice feature.

SteveMacenski commented 2 weeks ago

A collision checker though would be used inside of a controller/planner plugin, so if the controller / planner is not being utilized until data is valid yet, then there is no issue :-) There shouldn't be any calls to costmap data structures by the task server's algorithms until isCurrent is true

SteveMacenski commented 1 week ago

Are you saying to use the costmap topic collision detector object? That is a good point - I think though we could adjust how the costmap publishing is done so that we don't publish a new costmap until its current again. I don't think blocking the clear costmap service is a good idea since the costmap update rate is user-defined so it could be very slow. We could not update the published costmap topics as well as blocking the algorithm's use of the costmap until isCurrent() == true.

Is that what you mean / work for you?

schupf2 commented 1 week ago

Thank your for your quick replies.

Yes exactly - I was talking about the CostmapTopicCollisionChecker.

Yes that would be amazing and definitely solve our problem. I'm not sure if it could cause any problems for other existing users though. Maybe we could also introduce a parameter to determine if the use of the costmap should be blocked until isCurrent = true. However, in the long run I think the behavior you suggested would definitely be desirable.

SteveMacenski commented 1 week ago

@schupf2 I looked over the details and it should be as easy as adding isCurrent() to the condition here: https://github.com/ros-navigation/navigation2/blob/main/nav2_costmap_2d/src/costmap_2d_ros.cpp#L542-L544

As long as we keep the updateBounds calls in place even when not current, then once it becomes current then we should update the lot of it.

Would you be open to testing that out to make sure that resolves your concern and then submitting the PR? :-)

schupf2 commented 6 days ago

Amazing thank you. Definitely, I'll do that on Monday :)

schupf2 commented 3 days ago

I implemented the discussed change today and tested it. Unfortunately, it still doesn't solve the problem.

I discovered that the isCurrent() of the obstacleLayer briefly returns true after resetting the costmap. This happens due to the following two parts in the code:

In the ObstacleLayer::resetBuffersLastUpdated(), the last_updated_ of the individual observation buffers is reset to the current time. Thus, right after resetting the individual buffers will return true on isCurrent(), even if no data has been received.
Furthermore, in the ObstacleLayer::updateCosts (precisely here: https://github.com/ros-navigation/navigation2/blob/fbd1d3e7964dc220c0861d5e8bdcfcc7b9ec2812/nav2_costmap_2d/plugins/obstacle_layer.cpp#L542-544) current_ is also set to true in the first update after the reset. However, this does not at all verify that data is present / that the obstacle buffers are current.

SteveMacenski commented 3 days ago

Ah - I didn't catch that in my initial analysis.

I believe we set that to true after clearing since at that point, we've called updateMap for the updateBounds and updateCosts to have been triggered. Whether the buffers contain new data or not, from the main costmap's perspective, the layers have performed a complete update (just without promise that new sensor data has come in in the meantime). When using the sensor timeout parameter on the observation source, if an update hasn't come in in a sufficiently long time based on the sensor's update frequency, then we set current = false since there's something wrong with a sensor. Thus, for just 1 update without new sensor data, that is seemingly a valid and complete update to consider current, in general.

So from that perspective, this feels correct - we did successfully update the costmap layer, it just didn't contain a promise that data was processed yet. Though, if using a static layer, we know a map and inflation are now present on a single update. However the nature of "current" seems insufficient for the case where we (1) want to know if we performed a valid update and (2) the costmap layer actually had data that iteration as well. If you had a 5 hz publishing sensor and a 10 hz updating costmap, not each costmap update would have a new piece of sensor data and that is an expected behavior. Or a situation with multiple observations in a layer and only some subset of them are updated in the first iteration.

That doesn't address our problem here, though. I'd be open to suggestions. A couple of initial thoughts:

We have more than just isCurrent to know if all layers were actually updated with new data -- though a little problematic to decipher the outputs since some layers like the StaticLayer technically have updates, but are not frequently updated that you wouldn't want to block for -- but ObstacleLayer may be. Adding an isUpdated() API and how we process the output of that seems tricky to get right if its blocking planning/controlling.
Have some delay when a clearing costmap operation happens before the costmap collision checker object uses the costmap -- though that seems very fragile / hacky and would result in less responsive behaviors that use the costmap collision checker object. Or perhaps have an isUpdated API that blocks publication of a new costmap until all layers have a complete update. This would work, however it would just block any behavior relying on the costmap collision checker object for some time after a clearing. But, I suppose that is kind of the point, isn't it?
Clearing costmap service block until data comes in, but I think that mimics the same issues in my first thought since its hard to tell which layers "should" have updates & blocking the service call could create other data congestion issues

It seems like we'd need to make some promise about the fact that each observation source in each data processing costmap layer has not only been called to update, but that update contained new current data. And, do so in a general purpose way that accounts for the fact that some data sources come in faster / slower, regular or irregular. And also if that's really necessary in the way that we're thinking about it right now.

Any thoughts?

schupf2 commented 2 days ago

Thanks for your detailed answer :).

Yes, I agree that it does make sense for a costmap to be valid in general after it has undergone a complete update procedure. However, I think this makes the isCurrent somewhat inconsistent. After clearing the costmap it basically tells us whether the costmap has been fully updated (not taking into account if sensor data has been received as expected), while during all other times isCurrent evaluates to false when sensor data is not received as expected (even if the map has been updated correctly).

So from my perspective, also taking into account the presence of sensor data (also after a reset), seems like a valid solution. If I'm not missing something here, it might also be less likely to introduce further problems then the points you mentioned (especially since you already mentioned the problems ^^).

The changes I would suggest:

1) I don't see the necessity to reset the last_updated_ in the observation buffer after a reset, since the buffer itself is not reset (as far as I can see). So basically, when we reset the map the buffer itself is not affected. Therefore, I would suggest to NOT call ObservationBuffer::resetLastUpdated() if the map is cleared. That way, the isCurrent() of the observation buffer would always correctly take into consideration if sensor data has been received within the expected_update_rate. If the observation_keep_time > expected_update_rate it would then be safe to assume that there is sensor data in the buffer if isCurrent() evaluates to true.

2) You said that "I believe we set that to true after clearing since at that point, we've called updateMap for the updateBounds and updateCosts to have been triggered". However, the current_ is already set in updateBounds anyway - if there is no irregularity with the sensor, it will also be set to true right after the costmap is cleared. Thus, we would NOT need to set current_ to true after a reset in updateCosts, since it will be set in updateBounds anyway.

However, I noticed another thing here: If the map was cleared, the obstacle layer reports being current already after the call to updateBounds (assuming sensor data is present) and before the call to updateCosts - i.e. before the map is actually updated. Maybe we should take this into consideration as well and check for was_reset_ in the isCurrent() of the ObstacleLayer.

So to sum it up. I think if we remove line 760 and 544 in the ObstacleLayer, we would obtain consistent behavior no matter if the costmap was reset. For the normal case (when sensor data is received as expected), after a reset the costmap is current again as soon as the map was updated. While if we didn't receive sensor data in the expected update rate, the costmap will not be current regardless of whether it has been cleared or not.

Combined with the change of only publishing the costmap if it's current, this would also solve my problem (at least from what I can see in a preliminary analysis) :).

SteveMacenski commented 2 days ago

while during all other times isCurrent evaluates to false when sensor data is not received as expected (even if the map has been updated correctly).

I think it is consistent: it takes time before the sensor timeout can trigger. If you have a 10 hz updating sensor and you set your timeout to 0.2 seconds to give it some leeway, then when we reset() or clear, it could be up to 0.2 seconds before we should evaluate to false. The same would happen during normal non-reset times, its just that when we reset, the timeout itself is reset at that very moment, rather than being on a rolling basis. I think these are exactly the same.

also taking into account the presence of sensor data (also after a reset), seems like a valid solution.

I don't entirely disagree with you, though I'm cautious because I know costmaps can be implemented with inputs that are either not regularly updating sensors (i.e. not a regular frequency) or based on slow updating inputs (1hz AI output or even on event basis). If we do this, then we would (1) have to wait until all costmap layers and their constituent observation sources have some new data and (2) block publication and algorithms until such a time -- which could be arguably good. We could make a carve out for special cases like the Static Layer that are not updated regularly, but that now means its another API that has to be implemented and managed. Doable, but starts to feel like patches on top of patches and is further complicated if the inputs are not simple lasers, depth cameras, etc that update regularly and frequently.

I don't see the necessity to reset the lastupdated in the observation buffer after a reset, since the buffer itself is not reset (as far as I can see).

I tried to find the origins of this, but its been done since ROS 1 Navigation, so its hard for me to pinpoint. Its at least 11 years old, but it seems to me to have been in the stack since the very first commit on costmap 2d, or so close to it that even GitHub itself doesn't have the stored Blame information for me to backtrack further.

I believe this could be due to a reset operation taking arbitrarily long, or reset is meant to reset all state and that's an important element of the layer's state. It used to unsubscribe to the topic, reset the map, then resubscribe to the topic, and finally reset the last updated time (which makes sense if a subscription is new). There was an activate and deactivate function that in reset are called relatively quickly, but could in concept be called independently with an arbitrary amount of time between them. We don't do the same activate/deactivate for the reset anymore.

So, I'm a bit cautious about a change to that.

However, the current_ is already set in updateBounds anyway

Update bounds is called in the updateMap loop in costmap 2D, its not called as part of the clearing or reset operation. So, if we updateMap and we have valid (although potentially empty) observation buffers, then it would always be current_ = true;. Perhaps this can be removed. I believe the thinking is that if we reset the costmap, we don't want to block the next planning/control iteration that'll be immediately triggered & the #1999 that this PR was in response to was w.r.t. the Map not being present after a reset, so my choices on the data processing layers may be flawed.

I can really see an argument for removing that current_ & was_reset_ logic so that current_ is not forced true, allowing it to be false if data doesn't come in within the buffer timeout without an immediate override. I'd love a 3rd sanity check on that, but I'm on board. I think that also addresses your next text on "However, I noticed another thing here:"?

check for wasreset in the isCurrent() of the ObstacleLayer.

current_ = false; is set in the reset along with was_reset_ = true;. So, isCurrent() is already going to return false until updateMap runs to trigger updateBounds / updateCosts. I think that's already baked in at the moment.

So to sum it up.

I think if we remove line 760

@mikeferguson can you provide some historical context and/or let me know your thoughts?

and 544 in the ObstacleLayer,

Tentatively agree. @mikeferguson do you concur?

schupf2 commented 16 hours ago

Doable, but starts to feel like patches on top of patches ...

You're definitely right there. I also think we should keep it as simple as is and only consider the expected_update_rate instead of the actual presence of data.

I believe this could be due to a reset operation taking arbitrarily long, or reset is meant to reset all state and that's an important element of the layer's state ...

Yes, if we used to unsubscribe and resubscribe to the sensor topic the resetBuffersLastUpdated definitely seems to have been necessary, but as this is not done anymore I think the resetBuffersLastUpdated should not be needed.

In terms of "a reset operation taking arbitrarily long" - this should not matter right? For the ObservationBuffer we store new point clouds in the buffer when received. So if we don't unsubscribe while clearing the map, we will also obtain updates during that time - so the actual update rate should not be affected.

So, I'm a bit cautious about a change to that.

Definitely understand that though :-).

So, if we updateMap and we have valid (although potentially empty) observation buffers, then it would always be current_ = true

As far as I can see it, we should not obtain a observation buffer that is empty and valid IF we set the observation_persistence and expected_update_rate accordingly (since the observation buffer is not cleared during a reset). That is totally fine IMO, since it's basically up to the user if they are willing to accept potentially empty observation buffers or not.

I'd love a 3rd sanity check on that, but I'm on board.

:D

current_ = false; is set in the reset along with wasreset = true;. So, isCurrent() is already going to return false until updateMap runs to trigger updateBounds / updateCosts.

Yes exactly, but in the LayeredCostmap::updateMap the updateBounds is of course always called before the updateCosts. Thus, (in theory) after a reset isCurrent could be set to true before the the map is actually updated in updateCosts (at least for the obstacle layer), leading to a costmap with isCurrent being true before the obstacle layer is included in the map.

However, I didn't notice any problems with that so far, so I guess this should not cause problems since the costmap will be updated right after.

Looking forward to @mikeferguson thoughts :).

ros-navigation / navigation2