ros-perception / vision_msgs

Algorithm-agnostic computer vision message types for ROS.
Apache License 2.0
157 stars 75 forks source link

Semantic Segmentation messages #71

Closed pepisg closed 1 year ago

pepisg commented 2 years ago

In this PR I aim to add a semantic segmentation message. I'm aware of the discussion going on on #63, however I think using images as masks may leave behind important metadata like the class_map (the name of the class corresponding to each ID). Also, using several channels in the image for sending the mask and the confidence of each pixel could lead to varying implementations that may not be compatibles with standard functionalities relying on semantic segmentation like the in progress semantic segmentation costmap plugin PR in nav2.

mintar commented 2 years ago

I agree that having additional metadata would be useful, but is that worth the disadvantages I've outlined in https://github.com/ros-perception/vision_msgs/issues/63#issuecomment-1032765195?

SteveMacenski commented 2 years ago

@mintar totally agree. Looking at the proposal, I don't know how to massage all that content into a sensor_msgs/Image. Do you have a suggestion for that? For instance the confidence of the predictions and the threshold the algorithm is told to use about whether or not something is sufficiently confident for use. class_map also seems important to know what means what -- though I think that could be a potentially massive thing that might not be ideal to embed into the messages (maybe a camera_info -> semantic_info eq. is needed in addition the Image to contain that info)?

I suppose the confidence_threshold could be a ROS parameter for the end user without too much of an issue.

Between parameters and a semantic_info topic with the class maps, I think that handles quite a bit of the issue. The only thing that I'm not sure how to model with Image is the confidences.

Maybe we use a 2 channel image format, 1 for the class and a second for the confidence? Does the CV tools work for 2 channel images or would we need to add in a 3rd for visualization purposes. I suppose the 3rd channel could be threshold in case there are different thresholds for different classes / sections of the images, but I'm not sure if that's a 'normal' thing to do or I'm just coming up with ideas to try to fill space.

mintar commented 2 years ago

@SteveMacenski wrote:

@mintar totally agree. Looking at the proposal, I don't know how to massage all that content into a sensor_msgs/Image. Do you have a suggestion for that? For instance the confidence of the predictions and the threshold the algorithm is told to use about whether or not something is sufficiently confident for use. class_map also seems important to know what means what -- though I think that could be a potentially massive thing that might not be ideal to embed into the messages (maybe a camera_info -> semantic_info eq. is needed in addition the Image to contain that info)?

I don't think the class_map would be that massive. Let's be generous and assume 20 characters per class name, and let's assume we have 255 classes. Then the whole array would be 255 * (20 + 1) = 5355 bytes. A 640x480 uint8 image has 307200 bytes, so the class_map would be around 1.7% of that. On the other hand, 5kb overhead is not nothing, so I'd like to hear better ideas.

BTW, how to handle the mapping from int to class string is also an open issue with the existing messages (Detection3DArray etc.). I currently handle that by having a yaml file with the mapping that I load into the namespace of every single node that requires it. I would prefer it though if it was available as a message - that way if you record a rosbag, it would reflect the mapping at the time the rosbag was recorded.

I like the idea of publishing all the metadata on a "semantic_info" topic, similar to the camera_info topic. If we're really worried about the overhead of the class_map, perhaps we don't publish the semantic_info in sync with each frame, but instead only whenever it changes on a latched topic?

Maybe we use a 2 channel image format, 1 for the class and a second for the confidence? Does the CV tools work for 2 channel images or would we need to add in a 3rd for visualization purposes. I suppose the 3rd channel could be threshold in case there are different thresholds for different classes / sections of the images, but I'm not sure if that's a 'normal' thing to do or I'm just coming up with ideas to try to fill space.

Here's a list of all OpenCV image encodings: sensor_msgs/image_encodings.h

You can have 2-channel images, like 8UC2 (8 bit, unsigned, 2 channels) or 16UC2, and OpenCV handles them just fine. I'm not sure how well image_transport (for on-the-fly PNG or JPG compression) or visualization tools like RViz handle these more exotic formats. Or all the other stuff like image_proc, image_geometry and so on.

Alternatively, the 2 images (semantic segmentation and confidence) could be published on 2 different topics. That would have the following advantages:

SteveMacenski commented 2 years ago

5kb overhead is not nothing, so I'd like to hear better ideas.

Have a semantic_info topic / message like we have a camera_info topic / message for metadata regarding the camera. Now its metadata regarding the AI system. That way we can only grab this once (or pair up with the sensor_msgs/Image) and let us use the Image message without class definitions in it. That's not uncommon in my experience with camera_info to just grab 1 message and then stop subscribing to it.

would prefer it though if it was available as a message

Maybe then this is more generic than just a semantic_info topic but a vision_info topic? Latched seems fine to me!

the 2 images (semantic segmentation and confidence) could be published on 2 different topics.

That is interesting. I wonder about the run-time impacts of that since anything about 1-2MB in size has a hard time publishing without zero-copy or composition. I'm not sure if its better to have one 2MB message or two 1MB messages going into the same node. I suspect 1 message is better from some internal Samsung projects I'm helping with, but I don't have the data specifically to support that at the moment.

pepisg commented 2 years ago

I agree with the idea of using sensor_msgs/msg/Image instead. From what you have been discussing, this is an sketch of how a semantic_info message would look like

vision_msgs/LabelMap[] label_map # uint16 - string key value pairs
float32 confidence_threshold

This message seems to also work for object detection models using the other messages on vision_msgs. int64 is used for ids there, but can there really be more than 65536 classes? If using a single multichannel image for packing the mask and the confidence, I think it may be worth adding a field to somehow tell how the channels in the image are organized, this way it would not be necessary to define a fixed convention to use segmentation data with other ROS packages like nav2. This however would break the idea of having a message to fit AI applications

With that in mind, when things are settled I can reformat this PR to include these messags

SteveMacenski commented 2 years ago

VisionInfo maybe a good use for the message name if we plan to use this for detections and segmentations?

I think uint16 is a good bet for the moment and the current state of AI. If in another decade our magic AIs can know millions of things, we can always increase it.

pepisg commented 2 years ago

Should we settle for two distinct images for the mask and confidence? Or should we use a single dual-channel image and define a convention to specify which channel should contain what? This could be overcame by adding a metadata field to the VisionInfo message to make explicit the channel order, or applications relying on semantic segmentation data (like the segmentation plugin for nav2) could take in some parameter in their configuration to let them know where to look.

The only drawback of using two images seems to be the potential impact of publishing 2 data intensive messages instead of one. That would also force the message consumer to synchronize them (which can be done with a message filter, but drifts from ros standard implementation). This approach seems to be more flexible and does not rely on assumed conventions nor requires extra configuration parameters on the consumers.

Let me know what you think to fix the PR and adjust the work on the segmentation plugin

mintar commented 2 years ago

I'd prefer two separate image topics instead of a two-channel image.

SteveMacenski commented 2 years ago

I prefer a single 2-channel image over 2 separate images :laughing:

Maybe we need to ask @tfoote what his thoughts are. He usually brings up very good points when I propose interface definitions.

mintar commented 2 years ago

One thing that speaks for two separate images is that you can have separate data types. You wrote yourself that it's possible to have more than 255 classes. In that case, the segmentation image would need to be a uint16. If you have a single 2 channel image, that would mean the confidence image must also be uint16. With separate images, you could always use uint8 for the confidence image and uint8 or uint16 for the segmentation image, depending on the number of classes that you have.

SteveMacenski commented 2 years ago

The option behind door number three is that we actually don't include confidence information at all, and make as part of the Interface API that any pixel below a certain threshold should be set to -1, meaning invalid, as part of populating the image. Therefore, when any consumer of this work is parsing the image, any value of -1 means it was too low of confidence for use, provided by the publisher node. Just push that logic down.

mintar commented 2 years ago

We can't do -1 with an unsigned int, but in semantic segmentation images "0" usually means "unknown class". The confidence image is still nice extra info to have.

mintar commented 2 years ago

Another advantage of two separate topics: Not all subscribers may be interested in the confidence values, so they don't even need to subscribe to that topic.

SteveMacenski commented 2 years ago

We can't do -1 with an unsigned int, but in semantic segmentation images "0" usually means "unknown class". The confidence image is still nice extra info to have.

Well -1 is wrapped around in unsigned ints, so it would be max. I wouldn't want to override 0 since that could be a meaningful value for a detection/segmentation algorithm and it would be annoying if we had to have folks increment all their outputs by 1 to meet the standard the interface defines. Its far less disruptive / likely that a collision would occur at 65,536

I suppose we could do both; publish the confidence image and also set the precedence that 65,536 is a reserved value for not confident enough, if the publisher can know that value. That would get both worlds, but also require algorithms implementing this to act about both, if available.

tfoote commented 2 years ago

I prefer a single 2-channel image over 2 separate images laughing

Maybe we need to ask @tfoote what his thoughts are. He usually brings up very good points when I propose interface definitions.

The primary reason for choosing to send one multi channel image versus 2 separate images would be if they generally would be used usefully separately. The simplest examples are stereo images. If you want just a monocular camera stream, you don't want to incur the overhead of subscribing to the full stereo image just to ignore the second half of the stream. On the flip side there is a small amount of overhead related to synchronizing incoming images, and they have to have enough embedded content to be useful on their own (aka they both have full headers). We have tools to make it easier, but the extra overhead is why we don't send everything decoupled.

The only drawback of using two images seems to be the potential impact of publishing 2 data intensive messages instead of one. That would also force the message consumer to synchronize them (which can be done with a message filter, but drifts from ros standard implementation). This approach seems to be more flexible and does not rely on assumed conventions nor requires extra configuration parameters on the consumers.

I would disagree and say that publishing the synchronized image messages is the ROS standard. We do it for stereo images, compressed images, and camera_info.

And if you're publishing two images instead of one they should be half the size each as they will have half the data each (or some ratio depending on the datatypes)

The other major concern in this sort of discussion is encapsulation. You want to make sure that messages generally can stand on their own without knowing the context that they were broadcast in. Such that if you get a rosbag of some random data it will have everything that you need to process it later.

mintar commented 2 years ago

@SteveMacenski wrote:

Well -1 is wrapped around in unsigned ints, so it would be max. I wouldn't want to override 0 since that could be a meaningful value for a detection/segmentation algorithm and it would be annoying if we had to have folks increment all their outputs by 1 to meet the standard the interface defines. Its far less disruptive / likely that a collision would occur at 65,536

I've checked what other semantic segmentation / panoptic segmentation datasets are using, and most of them seem to use 255 for the "UNLABELED" class (bdd100k semantic, COCO-Stuff), although some use 0 (bdd100k panoptic, CityScapes). So I agree we should probably use 255 here. The only trouble is that "-1" means 255 if the semantic segmentation image encoding is mono8, and "-1" means 65,536 if the encoding is mono16. I would like to allow both encodings for the segmentation image.

@tfoote wrote:

The primary reason for choosing to send one multi channel image versus 2 separate images would be if they generally would be used usefully separately. [...] And if you're publishing two images instead of one they should be half the size each as they will have half the data each (or some ratio depending on the datatypes).

If we include the UNLABELED class (which we absolutely should), then the segmentation image is very useful on its own. I think this probably covers at least 80% of use cases; most clients probably won't care about the confidence image. So that would speak for having two separate image topics.

SteveMacenski commented 2 years ago

I'd like to bring up though that we're now talking about having to synchronize 3 topics potentially (vision_info, segmentation_mask, segmentation_mask_confidences) for a "typical and full" implementation. That seems to be approaching excessiveness, even though I can live with the individual technical points that leads to that outcome.

I'm not sure I agree that they would both be useful separately. The segmentation mask without confidences could be interesting, but the confidences without the segmentation mask is not - unlike the stereo example where either image feed could be individually interesting without the other.

mintar commented 2 years ago

If we go with the latched topic for the vision messages, we only need to synchronize 2 topics, and only those clients that are interested in the confidences. You're right that the confidence image is not interesting on its own, but Tully's argument still holds: you save 50% data in a lot of use cases by having 2 topics.

SteveMacenski commented 2 years ago

There would need to be a level of documentation required for that workflow -- e.g. it would not be obvious to anyone just looking at the vision_info message / topic that it would be a transient local QoS published 1 time at initialization and cannot be synchronized with the mask/detections. While camera_info you can do a same thing with, that's being constantly published with the images too. So this would be a slightly different design pattern.

I'd think to be analog to existing similar things, we'd want to publish with the masks/detections as well, regardless of QoS. Those that want to grab it 1 time to use forever can do so, but then others that want it synchronized for whatever reason can, just like camera_info. I think we have to assume that some people will want to do that, which would involve 3 synchronized topics.

mintar commented 2 years ago

Publishing vision_info in sync is fine with me!

pepisg commented 2 years ago

To sum up, the consensus seems to be using 2 sensor_msgs/msg/Image along with a new vision_info message with the following structure:

vision_msgs/LabelMap[] label_map # uint16 - string key value pairs
float32 confidence_threshold

I will refactor this PR to only include the new vision_info and label_map messages.

Additionally according to @SteveMacenski suggestion some documentation should be added describing how:

  1. The -1 value in the segmentation mask, which would end up being the greatest unsigned value of the chosen encoding, means the pixel belongs to the UNLABELED class
  2. The workflow for getting vision_info messages when published using a transient local qos
  3. The workflow for synchronizing three topics containing a class mask, a confidence mask and the vision_info

Does that sound good @SteveMacenski @mintar ?

Also, please let me know what would be the best place to add the documentation

mintar commented 2 years ago

Sounds good! Comments on how to interpret the data fields (for example, the UNLABELED class) should go into the message comments of that field. More extensive documentation about the workflows should go into the README, IMO.

SteveMacenski commented 2 years ago

Agree, or maybe add in a vision_msgs_example package that has some trivial example node subscribing & synchronizing the 3 just so there's some boilerplate folks can work with (or just a snippet in the readme is fine)

pepisg commented 2 years ago

Hi @mintar, I just realized there is already a VisionInfo message defined here. Should I update it with what was discussed here or should I create another one called say InferenceInfo? I will do the latter since I imagine changing an interface can break people's code easily, but I will change back in case you prefer to do so

mintar commented 2 years ago

Huh, I forgot about the existing VisionInfo. I agree that we can't change it (adding fields is usually ok, but not completely remodeling the message), so we should have a new message. I'm not fond of the name InferenceInfo, but I can't think of a better name. @SteveMacenski , @Kukanani: Any ideas?

Kukanani commented 2 years ago

I like the interface that we've settled on here. In terms of naming, if InferenceInfo is specifically for use with semantic segmentation, should we call it something like SegmentationInfo?

Either way, I'm going to suggest some documentation improvements for clarity.

pepisg commented 2 years ago

I like the interface that we've settled on here. In terms of naming, if InferenceInfo is specifically for use with semantic segmentation, should we call it something like SegmentationInfo?

This message is not only meant for segmentation but for object detection as well, or in general to any vision pipeline that returns id based classifications that you later may want to convert back to human readable string class names.

SteveMacenski commented 2 years ago

LabelInfo? ClassInfo? Then the individual instances would be Label or Class, plus or minus a Vision.

I agree, I'm not a fan of InferenceInfo

jmm-slamcore commented 1 year ago

@pepisg is this still relevant? Think it would be great to get this merged to bring more obj. det./semantics capabilities to ROS2/Nav2

SteveMacenski commented 1 year ago

agreed!

Kukanani commented 1 year ago

List of outstanding items I see before this can merge:

Anything else?

pepisg commented 1 year ago

Alternate name for InferenceInfo. I propose SegmentationInfo but am also happy with ClassInfo or LabelInfo or perhaps some other proposal. Author's choice :)

Done. left it LabelInfo following @SteveMacenski 's suggestion and my earlier thought about this messages enabling users to convert id encoded data from vision pipelines to human readable class names. which may apply for object detection as well

Clarify all comments and docs to either be latched/transient local Qos (and no mention of "being like CameraInfo" or time synchronizers), or continually published. I don't have a preference, but we should be consistent in our intended usage, or else it will cause confusion

We seemed to agree on being two possible setups: a) latched topic; b) node subscribes to topic, gets one message and then unsubscribes. There are several comments on the PR which I tried to address on the last commit regarding this. Please let me know if you have any further comments on this section of the documentation.

Other misc comment typos/clarifications as proposed in PR comments

I guess I integrated all your change requests, let me know if there is anything else I should change / include

SteveMacenski commented 1 year ago

@Kukanani are you happy with these?