Make state machine definition more flexible

jibweb commented 2 years ago

smach enable the definition of sequences of states that behaves as states themselves. We have the definition of most states and their combination for one specific instances here, but it would be very nice to have defined more flexibly. Ideally, we should be able to define a state machine that does grasping and handover, and another one that does grasping and placement with minimal code duplication using smach properties

This is also a good opportunity to create a clearer documented number of states. What I mean is that ideally, there should not be a VeRefine state, and a PyraPose state, but rather a PoseEstimation state with a clearly defined interface, that can be implemented either by VeRefine code or PyraPose code, etc. This makes the grasping pipeline code much more adaptable, and makes it much easier to test new methods (one clear interface needs to be implented, and that's it)

jibweb commented 2 years ago

I would also simplify the user input aspect as part of this effort. If the different path pyrapose/verefine are handled by having different action implementation in the background that's also easier. And different state machines can simply different launch files depending on the needs?

Right now, the user input sort of defeats the purpose of a state machine, as most states can be executed any time, instead of things in the right order. The check before action execution can be replaced by a single user input in the ExecuteGraspAction

Let me know if I am missing something important here.

NB: There is a smach viewer to help understand what is happening, and as part of tracebot, I have been working on a "tracer" that logs input and output of every action and sequence executed into a json file. It should be easy to adapt to smach, if you think that could be useful

jibweb commented 1 year ago

As part of this effort, we need to come up with a common message for different methods. Current proposal for the known object pipeline (the step going from object pose to grasp pose will still be defined in the grasping_pipeline):

#GOAL
sensor_msgs/CameraInfo cam_info
sensor_msgs/Image color_image 
sensor_msgs/Image depth_image  # should be registered to the color image
string object_to_locate  # optional , if unspecified, returns everything found
---
#RESULT
geometry_msgs/PoseStamped[] poses
float32[6] box_extents  # Extent along +x, -x, +y, -y, +z, -z as the box might not be centered on the pose origin
float32[] confidences
string[] categories  # category name or instance name
---
# FEEDBACK

For the method performing pose estimation separately from the detection, a reference implementation of Mask-RCNN trained on relevant objects (YCB instances + classes relevant for Robocup@home) and a plane pop-out object detection will be provided with the following action definition to call from the object pose estimation wrapper:

#GOAL
sensor_msgs/CameraInfo cam_info
sensor_msgs/Image color_image 
sensor_msgs/Image depth_image  # should be registered to the color image
string object_to_locate  # optional , if unspecified, returns everything found
---
#RESULT
float32[] confidences
string[] categories # category name or instance name
sensor_msgs/RegionOfInterest[] bboxes # A list of bounding boxes for all detected objects
sensor_msgs/Image image # An image that can be used to send segmentation masks (value in image correspond to index of bbox + 1, as 0 correspond to the absence of objects)
# optionally an oriented bbox? I feel like the segmentation mask and the depth images are enough to recover what might be needed?
---
# FEEDBACK

Another message suitable for unknown object pipeline (should be suitable for direct grasp estimation, like plane pop-out+HAF, or gpd, or dexnet):

#GOAL
sensor_msgs/CameraInfo cam_info
sensor_msgs/Image color_image 
sensor_msgs/Image depth_image # should be registered to the color image
---
#RESULT
geometry_msgs/PoseStamped[] grasp_poses
float32[] confidences
---
# FEEDBACK

For references:

Here is the message used in RoboKudo (Bremen vision pipeline)
The project diagram for the current batch of Sasha bachelor students include links to a lot of message definitions

@lexihaberl @ChristianXEder @sThalham: I would love your opinion on this !

jibweb commented 1 year ago

I would love some feedback, but I think it would make sense to directly follow University of Bremen expected messages: They are defined here and here (multi images) or just below:

#goal
sensor_msgs/Image rgb
sensor_msgs/Image depth
string description

---
#result
bool success
string result_feedback

# A list of bounding boxes for all detected objects
sensor_msgs/RegionOfInterest[] bounding_boxes

# Class IDs for each entry in bounding_boxes
int32[] class_ids

# Class confidence for each entry in bounding_boxes
float32[] class_confidences

# An image that can be used to send preprocessed intermediate results,
# inferred segmentation masks or maybe even a result image, depending on the use case
sensor_msgs/Image image

# The best pose for each entry in bounding_boxes
geometry_msgs/Pose[] pose_results

# Array-based string feedback when generating text for all detected objects etc.
string[] descriptions

---
#feedback
string feedback

sThalham commented 1 year ago

At some point it might be limiting that only one pose and segmentation is returned per bbox. Thinking about part-based estimates, which is actually likely to not be inferred at the same time. So probably no troubles there.

Generally, it appears quite versatile and thought through. I also like the use of standard message desriptions.

On Tue, 6 Dec 2022, 7:01 pm Jean-Baptiste Weibel, @.***> wrote:

I would love some feedback, but I think it would make sense to directly follow University of Bremen expected messages: They are defined here https://gitlab.informatik.uni-bremen.de/robokudo/robokudo_msgs/-/blob/main/action/GenericImgProcAnnotator.action and here (multi images) https://gitlab.informatik.uni-bremen.de/robokudo/robokudo_msgs/-/blob/main/action/GenericImgListProcAnnotator.action or just below:

goal

sensor_msgs/Image rgb sensor_msgs/Image depth string description

result

bool success string result_feedback

A list of bounding boxes for all detected objects

sensor_msgs/RegionOfInterest[] bounding_boxes

Class IDs for each entry in bounding_boxes

int32[] class_ids

Class confidence for each entry in bounding_boxes

float32[] class_confidences

An image that can be used to send preprocessed intermediate results,

inferred segmentation masks or maybe even a result image, depending on the use case

sensor_msgs/Image image

The best pose for each entry in bounding_boxes

geometry_msgs/Pose[] pose_results

Array-based string feedback when generating text for all detected objects etc.

string[] descriptions

feedback

string feedback

— Reply to this email directly, view it on GitHub https://github.com/v4r-tuwien/grasping_pipeline/issues/11#issuecomment-1339763210, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJJNPPHRGLHQOPZE3ZV5A5LWL55OTANCNFSM55V43DUQ . You are receiving this because you were mentioned.Message ID: @.***>

jibweb commented 1 year ago

I think if needs be, we could also work around that limitation by having multiple copies of the same bounding box and different poses at the corresponding positions

lexihaberl commented 6 months ago

I think that issue is done for now.

v4r-tuwien / grasping_pipeline

Make state machine definition more flexible #11

goal

result

A list of bounding boxes for all detected objects

Class IDs for each entry in bounding_boxes

Class confidence for each entry in bounding_boxes

An image that can be used to send preprocessed intermediate results,

inferred segmentation masks or maybe even a result image, depending on the use case

The best pose for each entry in bounding_boxes

Array-based string feedback when generating text for all detected objects etc.

feedback