moveit / moveit_ros

THIS REPO HAS MOVED TO https://github.com/ros-planning/moveit
69 stars 118 forks source link

pointcloud_octomap_updater Segfault with PointCloud2Iterator #505

Closed davetcoleman closed 9 years ago

davetcoleman commented 9 years ago

@vrabaud

Filtering point clouds with Primesense / Asus Xtion / MoveIt! worked fine for me until I believe https://github.com/ros-planning/moveit_ros/commit/db046243ac0628ad28f2b71bde9af76f25cf54ce. Now it segfaults.

I'm listening to topic

/camera/depth_registered/points

Here is an example rosbag recording

It's header is:

header: 
  seq: 195611
  stamp: 
    secs: 1413582743
    nsecs: 277334231
  frame_id: /camera_rgb_optical_frame
height: 480
width: 640
fields: 
  - 
    name: x
    offset: 0
    datatype: 7
    count: 1
  - 
    name: y
    offset: 4
    datatype: 7
    count: 1
  - 
    name: z
    offset: 8
    datatype: 7
    count: 1
  - 
    name: rgb
    offset: 16
    datatype: 7
    count: 1
is_bigendian: False
point_step: 32
row_step: 20480

When I run move_group with point cloud filtering enabled, it crashes - backtrace - with:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe1222700 (LWP 18779)]
0x00007fffd1abde86 in occupancy_map_monitor::PointCloudOctomapUpdater::cloudMsgCallback (this=0xb92080, cloud_msg=...) at /home/dave/ros/ws_moveit/src/moveit_ros/perception/pointcloud_octomap_updater/src/pointcloud_octomap_updater.cpp:244
244         if (!isnan(*iter_x) && !isnan(*iter_y) && !isnan(*iter_z))

I dug pretty deep into the segfault - iter_x is moving past its max size. On row 25 out of 480 it already suprases its bound. I suspect it is because the row_step is 20480. iter_x is set to index 512000 but the cloud_msg height*width = 307200.

I tried this very related PR https://github.com/ros-planning/moveit_ros/pull/504, and I believe it is part of the solution, but I still get the segfault.

jasedit commented 9 years ago

It looks like PointCloud2Iterators iterate point-at-a-time, based on reading the relevant header. That would suggest that the iterators should not be incremented by row_step.

jasedit commented 9 years ago

I have a potential fix for this, but I cannot reproduce the issue with the rosbag provided. I can't publish tf's in the same time frame as the log, since it doesn't have any /clock topic. If there's a way to publish a tf with a specific timestamp, then I could validate my fix, but I have yet to figure out an easy way to do this.

davetcoleman commented 9 years ago

I can provide a better rosbag on Monday, but I wouldn't mind seeing your potential fix now.

jasedit commented 9 years ago

I've pushed the fix to my fork, which has included it in Pull Request #504.

vrabaud commented 9 years ago

I believe the fix is more:

iter_x += (row*cloud_msg->row_step) / cloud_msg->point_step;
davetcoleman commented 9 years ago

It doesn't crash anymore! But it is publishing an empty pointcloud, by that I mean all 0's:

header: 
  seq: 258
  stamp: 
    secs: 1413840761
    nsecs: 207251985
  frame_id: /camera_rgb_optical_frame
height: 1
width: 307200
fields: 
  - 
    name: x
    offset: 0
    datatype: 7
    count: 1
  - 
    name: y
    offset: 4
    datatype: 7
    count: 1
  - 
    name: z
    offset: 8
    datatype: 7
    count: 1
is_bigendian: False
point_step: 16
row_step: 4915200
data: [0, 0, 0, 0, 0, 0, 0, 0... 
jasedit commented 9 years ago

Does it publish all 0's using the code from #504 as well? The iteration in db04624 iterates over the filtered_cloud only - it doesn't actually process the incoming point cloud. I would expect all 0's from that code if you had a filtered_cloud_topic set, and a crash if you didn't.

davetcoleman commented 9 years ago

Brilliant! that worked, thanks so much @jasedit @vrabaud ~

jasedit commented 9 years ago

What changes should I make to #504 for this? Right now it has an attempted fix for this which is similar to the fix @vrabaud mentions above, but is slightly wrong if row_step/point_step != width. I can adjust my version to match, or eliminate the attempted fix altogether - just let me know.

vrabaud commented 9 years ago

just made https://github.com/ros-planning/moveit_ros/pull/506

davetcoleman commented 9 years ago

actually, it appears to be passing through all the points and not doing any filtering of my robot.

My sensors_rgbd.yaml file:

sensors:
  - sensor_plugin: occupancy_map_monitor/PointCloudOctomapUpdater
    point_cloud_topic: /camera/depth_registered/points
    max_range: 3.0 # The max range (in m) of the sensor
    padding_offset: 0.08 #0.15  #The padding offset to be considered for robot links and attached objects when self-filtering
    padding_scale: 1 #The padding scale to be considered for robot links and attached objects when self-filtering
    point_subsample: 1 # If the update process is too slow, points could be subsampled. Values above 1 make the updated skip points instead of processing them.
    filtered_cloud_topic: filtered_cloud # If this parameter is specified, the filtered cloud (without robot parts) is also republished. This makes things a little less efficient but can be useful for debugging.
davetcoleman commented 9 years ago

Side note: I still don't understand the difference between padding_scale and padding_offset

vrabaud commented 9 years ago

aaaah, too much overlap. @jasedit your fix also contains the CMake one so we might as well take it.

jasedit commented 9 years ago

I thought padding_offset was adding some additional padding, while padding_scale was scaling the size of links (e.g. multiplicative.) I haven't dug through all the code to verify that's the case though.

davetcoleman commented 9 years ago

I don't understand how those are different - they both equally make the collision area larger beyond the robot's actual size, right?

davetcoleman commented 9 years ago

Details on filtering broken: I added a counter inside the check for NaN:

if (!isnan(*iter_x) && !isnan(*iter_y) && !isnan(*iter_z))                                              
{                                                                                                                    
    unfiltered_size ++; 

Then subtracted to get the total points filtered:

 std::cout << "Diff in size: " << unfiltered_size - filtered_cloud_size << std::endl; 

And it is always zero.

jasedit commented 9 years ago

If a link was a 5m cube, a padding_offset of 0.01 would make it a 5.02m cube box (adds .01 on each side of the box). A padding_scale of of 0.01 would make it 5.05m on a side (adds 5% to each dimension of the box.) Again, I could be wrong - that's just how I interpreted it.

davetcoleman commented 9 years ago

I don't understand why we would need both an additive padding, and a multiplicative padding, but I suppose there is a use case.

I'm still not sure why filtering isn't working, even with this PR...

jasedit commented 9 years ago

For the filtering, is the sensor in the same frame as the body? I don't see any place where the incoming message is transformed into a frame relative to the shapes being filtered against - if the frames are different and place correctly, you would see rather weird results depending on relative positioning.

davetcoleman commented 9 years ago

Yes, I have the camera calibrated to the body using TF.

I tried making the padding huge (99) which should filter everything even if it was a reference frame issue, but it still did not work.

jasedit commented 9 years ago

I can confirm that scaling and padding work as I described. You can trace the values down into geometric_shapes, where they are used in the fashion I described (e.g. line 147 of bodies.cpp)

I would assume scaling acts as a coarse grain padding dial, whereas padding is a fine grained one. But it appears to provide both multiplicative and additive padding factors.

davetcoleman commented 9 years ago

It seems my shape mask is not transformed to the correct place. I take the merged bounding spheres' radius and center point (from shape_mask.cpp) and publish it over Baxter and I get:

sphere

I haven't figure out where the transform is wrong... TF is setup correctly and everything I am attempting with Baxter I had working last Spring. I do not believe the change is wrong on my end - here is the TF:

sphere2

You can see I have a well calibrated Asus Xtion pointed at Baxter

davetcoleman commented 9 years ago

There seems to be an issue with

const tf::Vector3 &sensor_origin_tf = map_H_sensor.getOrigin();

and related code... the sensor origin does not seem to be ever used anymore, since this commit: https://github.com/ros-planning/moveit_ros/commit/7690c05ed722ce2fb201f3aa0f95344007dab6ce Any advice is appreciated!

jasedit commented 9 years ago

The sensor data is processed with whatever coordinates come in - it seems like there needs to be a transformation of the data into a common frame, whether before it arrives here, or inside the processing loop.

jasedit commented 9 years ago

The commit you reference seems strange, in that it implies that previously it was assumed the sensor data and body frame were axis aligned, but not the same coordinate frame. Is that commit when filtering stopped working?

davetcoleman commented 9 years ago

Ignore my last two posts, I was looking at the data wrong. My transforms are all working correctly, the problem is just in how it copies the original point cloud to the filtered one. Somehow its getting the wrong indexes or column/row ordering. I believe this because I inverted the filter such that it is only publishing the points that would have been removed (the ones in collision with the robot) and I still get a fairly evenly scattered (but randomly filtered) point cloud as if I was only down sampling the whole cloud (but I have downsampling disabled)

jasedit commented 9 years ago

So if you remove the filter inversion, you get points that are not on the robot, but they appear down sampled?

davetcoleman commented 9 years ago

Yes, that too. Its just randomly removing points.

davetcoleman commented 9 years ago

I removed this commit https://github.com/ros-planning/moveit_ros/commit/db046243ac0628ad28f2b71bde9af76f25cf54ce and everything works fine again. I've spent 3 days on this bug so am a little worn out, but perhaps @vrabaud can look at it again.

mikeferguson commented 9 years ago

@davetcoleman -- do you have an example moveit config you can point me at? I haven't used the updater in a while, but I can update maxwell and try to see if I can track anything down.

davetcoleman commented 9 years ago

Of course - this is my sensors_rgbd.yaml and below that file you can find everything else. Let me know what else you need, thanks!

mikeferguson commented 9 years ago

Ok, the issue is definitely in ShapeMask::maskContainment:

I'm not sure how we want to fix this for real, we could take the fix I have, it at least works, but I'm not sure the performance hit. I'm going to continue looking into this a bit more tonight, but not sure I'll have a full solution.

mikeferguson commented 9 years ago

Ok, found the bug in the point cloud iterator -- no changes are actually needed to ShapeMask, but https://github.com/ros/common_msgs/pull/47 needs to be merged and released.

mikeferguson commented 9 years ago

Also #504 should be merged, I put a few comments there (mainly do we want to squash that into a single commit to keep the history clean).

@davetcoleman could you check out that branch of sensor_msgs and see if it works for you (my only other concern is the frame issues, not sure if your TF tree is vastly different from mine because of the kinect location).

vrabaud commented 9 years ago

@davetcoleman , I hope this iterator fix works for you. I'll release it today. BIG THX @mikeferguson

davetcoleman commented 9 years ago

will test in the AM

davetcoleman commented 9 years ago

504 works, awesome! Thanks for your help

mikeferguson commented 9 years ago

Sweet!