Open lexihaberl opened 1 month ago
What is the current strategy?
Sry, was interrupted by an unforeseen, sudden lunch break >.<
Scene change is definitely full of pitfalls, if you even slightly touch the object (which is the most likely failure in my opinion, we don't usually fail by 30cm) it can fall apart. So I would be leaning toward the re-detection.
Alternatively, the in-hand camera could provide extra info / the robot could look at its gripper instead of the grasping area. In both cases, by moving slightly the gripper we would expect scene changes (either around the gripper and nowhere else when using the head camera, or far from the gripper from the in-hand camera that moves with the gripper). That could be a nice approach with minimal assumptions
I’ve been testing the performance of scene change detection, but I've run into a few hiccups along the way. The issue arises because the HSR checks whether the grasp was successful before returning to the "table" waypoint, meaning I need to compare the point clouds after grasping. Unfortunately, ICP wasn’t able to reliably transform these matrices. Instead, I tried using the known transformation from the camera to the map frame for both before and after grasping. This approach resulted in the correct orientation but an incorrect translation:
Has anyone else faced this issue? I suspect the problem might be related to the order of translation and rotation differing between ROS tf (as mentioned here) and how Open3D handles transformations (detailed here). If anyone has encountered this problem, I’d appreciate your insights. If I can’t resolve it, I’ll consider implementing JB's idea for re-detection. (I was also thinking of using the table plane along with the marker as a reference point to transform the point cloud in this manner.)
I am confused how the robot translation could be so wrong, it's almost like it doesn't update its position at all during that grasping time :-/ It might be easier to only manipulate one of the two kind of transformation. Open3d has a function to convert from quaternion to rotation matrix, which can be combined with the translation in a 4x4 pose matrix, the input to the transform() function of open3d
I updated the grasping pipeline to compare objects on the table before and after each 'unsuccessful' grasp (gripper closed) by checking object names (Link). This works with known or unknown object detectors, but since we aim to use Grounded SAM2, and its object names are inconsistent, we’ll pause this work and wait for the Robot Vision student to address it. If a solution is needed sooner, I can resume development. My next idea would be to compare the bounding box of the grasped object before and after the grasp. This has problems, such as when an object is placed back in the same position between grasps. We may need to combine this approach with category-level object name comparison for better accuracy.
Problem: Currently, we determine the success of a grasp based on whether the gripper is fully closed. This works well for most objects, since the gripper doesn't fully close due to their thickness. However, this approach leads to false negatives with thin objects (e.g., sheets of paper), as the gripper may fully close even when the object has been successfully grasped.
We need an additional check to verify if the object has been removed after the initial gripper-closure check 'failed'.
Potential solutions discussed in the meeting:
Object re-detection: Re-run the object detection process after grasping to check if the object is still visible. This approach may not work well with methods like Grounded SAM due to the instability of their detections.
Simple scene change detection (using point clouds): Detect changes in the scene to determine if the object has been grasped. However, this method may struggle to differentiate between an object being knocked over or being successfully grasped, as both cause significant scene changes. Additionally, this method might not be the best solution for our use case, since we only need this approach for flat objects, which don't have that many depth pixels in the first place.