ros-perception / vision_msgs

Algorithm-agnostic computer vision message types for ROS.
Apache License 2.0
155 stars 74 forks source link

Keypoints and polygons instead of bounding boxes #28

Closed mistermult closed 3 years ago

mistermult commented 5 years ago

Hi,

currently I'm integrating a face detector. It does not emit a bounding boxes, but key points, e.g. one point in the center of the left eye, one point in the center for the right eye, one point for the center of the mouth (simplified). There are two possible ideas to model it in vision_msg:

  1. Use one detection for each keypoint with one hypothesis. Make a bounding box of size 0x0 at the key point and set hypothesis.id = "left_eye"/"right_eye". ++ No need to extend vision_msg -- How to group multiple faces in one image? Maybe by hypothesis.id="left_eye_1", but this would generate ids that are not in the database. Maybe by tracking_id? -- If faces are somehow grouped by some id, clients have to iterate through all detections and group the to find all key points for some face. -- Does not describe the domain. Clients need more implicit knowledge. -- Cannot describe polygons.

  2. In the annotation tool CVAT, which is connected to OpenCV, you can use either use bounding boxes or an array of points. The latter is used to annotate key points, or arbitrary polygon shapes. If the points are meant to be key points, the order implicitly defines what each point describes, e.g. the first point is always the left eye etc. This would yield new messages:

    
    #Detection2D.msg
    Header header

Class probabilities

ObjectHypothesisWithPose[] results

2D bounding box surrounding the object.

BoundingBox2D bbox

Keypoints in the image or arbitrary polygon.

geometry_msg/Point2D[] points

The 2D data that generated these results (i.e. region proposal cropped out of

the image). Not required for all use cases, so it may be empty.

sensor_msgs/Image source_img

If true, this message contains object tracking information.

bool is_tracking

ID used for consistency across multiple detection messages. This value will

likely differ from the id field set in each individual ObjectHypothesis.

If you set this field, be sure to also set is_tracking to True.

string tracking_id


-- Need to extend vision_msg
++ Can also describe key points and polygons.
++ Describes the domain with key points.
++ One array of points can model all the other shapes of other annotation tools.
-- More complexity, especially if clients want to support polygons. Key points should be OK. Clients might just ignore them.
-- Clients might get a message with a empty/default bounding box if only key points are set. Maybe add a field like shape_type (0 = boundingBox, 1 = key point, 2 = polygon) or has_points.
** There are other ways to model it, e.g. by a completely new message KeypointDetectionXD.

Let me know what you think about support for key points/polygons and the above extensions of Detection2D.
Kukanani commented 5 years ago

I would recommend creating new messages that are specific to faces given the constraints you outlined. We can talk about adding these to the package as well. Especially if CVAT's keypoints have a defined order for face detection as you say, it would make sense for this to be explicitly coded in a message type. So you could have a CVATFace message type, or just a generic FaceKeypoints message, that has a face_id field or something similar. Then if you'd like, you could match the face_id tracking_id or class_id (depending on how your detector works) of an existing message type.

I think that linking side information to the Detection/Classification messages in this way is much more flexible than trying to support every use case. Other pipelines (Semantic Segmentation, or Detectron for example) output a pixel mask, or a heat map, and it doesn't make sense for the Detection message to encode all possible representations of the output.

If you'd like, you could still use Detection2D for the individual features of the eye, although I think this would be overkill.

On Fri, Aug 9, 2019 at 11:04 AM mistermult notifications@github.com wrote:

Hi,

currently I'm integrating a face detector. It does not emit a bounding boxes, but key points, e.g. one point in the center of the left eye, one point in the center for the right eye, one point for the center of the mouth (simplified). There are two possible ideas to model it in vision_msg:

1.

Use one detection for each keypoint with one hypothesis. Make a bounding box of size 0x0 at the key point and set hypothesis.id = "left_eye"/"right_eye". ++ No need to extend vision_msg -- How to group multiple faces in one image? Maybe by hypothesis.id="left_eye_1", but this would generate ids that are not in the database. Maybe by tracking_id? -- If faces are somehow grouped by some id, clients have to iterate through all detections and group the to find key points for some face. -- Does not describe the domain. Clients need more implicit knowledge. -- Cannot describe polygons. 2.

In the annotation tool CVAT, which is connected to OpenCV, you can use either use bounding boxes or a array of points. The latter is used to annotate key points, or arbitrary polygon shapes. If the points are meant to be key points, the order implicitly defines what each point describes, e.g. the first point is always the left eye etc. This would yield new messages:

Detection2D.msg

Header header

Class probabilities

ObjectHypothesisWithPose[] results

2D bounding box surrounding the object.

BoundingBox2D bbox

Keypoints in the image or arbitrary polygon.

geometry_msg/Point2D[] points

The 2D data that generated these results (i.e. region proposal cropped out of

the image). Not required for all use cases, so it may be empty.

sensor_msgs/Image source_img

If true, this message contains object tracking information.

bool is_tracking

ID used for consistency across multiple detection messages. This value will

likely differ from the id field set in each individual ObjectHypothesis.

If you set this field, be sure to also set is_tracking to True.

string tracking_id

-- Need to extend vision_msg ++ Can also describe key points and polygons. ++ Describes the domain with key points. ++ One array of points can model all the other shapes of other annotation tools. -- More complexity, especially if clients want to support polygons. Key points should be OK. Clients might just ignore them. -- Clients might get a message with a empty/default bounding box if only key points are set. Maybe add a field like shape_type (0 = boundingBox, 1 = key point, 2 = polygon) or has_points. ** There are other ways to model it, e.g. by a completely new message KeypointDetectionXD.

Let me know what you think about support for key points/polygons and the above extensions of Detection2D.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Kukanani/vision_msgs/issues/28?email_source=notifications&email_token=ABUS6HYTG4NIVHVMUPHS52LQDWIQRA5CNFSM4IKVMBN2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HEN4FBQ, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUS6HZVGHCPGPJYDZC7UHLQDWIQRANCNFSM4IKVMBNQ .

Kukanani commented 3 years ago

Closing due to inactivity.