Why is the /diagnostics_toplevel_state ERROR when one of the diagnostics is STALE

ros / diagnostics

Packages related to gathering, viewing, and analyzing diagnostics data from robots.

https://index.ros.org/p/diagnostics/

Other

89 stars 174 forks source link

Why is the /diagnostics_toplevel_state ERROR when one of the diagnostics is STALE #297

Open Rayman opened 1 year ago

Rayman commented 1 year ago

In the current code the toplevel state is only STALE when ALL the diagnostics are STALE. Example:

group_analyzer

What I think is more logical is that the state is STALE when one of the diagnostics is STALE and none of them ERROR: group_analyzer2

What do you think about this logic? I've implemented this in my fork, but it is a breaking change

g-gemignani commented 1 year ago

Hi @Rayman and thank you for your comment. I agree with you that this makes more sense but, as you stated, this is a breaking change so for Noetic I do not believe it can be merged as is. I am not sure how the ros2 maintainers would see this change for the ros2 version...

ct2034 commented 1 year ago

Hi @Rayman. Thanks for your suggestion. Just to be clear: The stale state is set by the generic analyzer if no message was received within a given timeout: https://github.com/ros/diagnostics/blob/a80bd1c33786e5e5642f91b2b6016048f32fbf0e/diagnostic_aggregator/include/diagnostic_aggregator/generic_analyzer_base.hpp#L196 Which can be useful information on the actuality of a state.

If I get it correctly, your suggestion is to treat it in aggregation like the other levels and aggregate it in the group. There, I would honestly have a hard time to rate it in severity between the other levels. Currently, it is level 3 which reads as the highest priority. This means you would only see stale on the highest level, even if another item in that group is in the error state. This can not be the intended behavior. Changing these levels would be a SERIOUS breaking change.

What is your take on that?

Rayman commented 1 year ago

The toplevel state is not just the maximum of all the levels. Its calculated with the following algorithm

if maximum_level > ERROR and minimum_level <= ERROR
    # one or more STALE, but not all of them
    level = ERROR
else:
    level = maximum_level

I would propose to change this to the following, because I think it's more logical.

if maximum_level == STALE and maximum_level_without_stale < ERROR
    # one or more STALE, but no errors
    level = STALE
else:
    level = maximum_level_without_stale

This will be the difference between the two algorithms:	diagnostic1	diagnostic2	current
stale	ok	error	stale
stale	warn	error	stale
stale	error	error	error
stale	stale	stale	stale

asymingt commented 1 year ago

What I find counter-intuitive about the current behavior is that if you have three leaf diagnostics rolled up into a group, the discard_stale doesn't seem to have an impact on the parent status. For example, if bar and baz in the example below go stale, but foo is OK, I would intuitively think that the part group should also be OK. However, what I'm seeing is that foo currently gets marked as ERROR.

diagnostics_aggregator:
  ros__parameters:
    pub_rate: 1.0
    path: 'robot'
    analyzers:
      part:
        type: 'diagnostic_aggregator/AnalyzerGroup'
        path: 'part'
        foo:
          type: 'diagnostic_aggregator/GenericAnalyzer'
          path: 'foo'
          find_and_remove_prefix: ['/foo:']
          num_items: 1
        bar:
          type: 'diagnostic_aggregator/GenericAnalyzer'
          path: 'bar'
          find_and_remove_prefix: ['/foo:']
          discard_stale: true
        baz:
          type: 'diagnostic_aggregator/GenericAnalyzer'
          path: 'baz'
          find_and_remove_prefix: ['/baz:']
          discard_stale: true

Rayman commented 1 year ago

I did not want to propose to merge a breaking change in noetic, so I've implemented my proposed change in our fork: https://github.com/nobleo/diagnostics. Feel free to use it

I've implemented the change for the toplevel diagnostics and for the AnalyserGroup.

asymingt commented 1 year ago

I also added my implementation here: https://github.com/ros/diagnostics/pull/315

Timple commented 8 months ago

Since this was a breaking change for Noetic, it probably also is for Humble and Iron?

Does it make sense to put some effort into this before ROS Jazzy?