Segfault in gazebo when robot moves out of start box

osrf-migration commented 7 years ago

Original report (archived issue) by Vinayak Jagtap (Bitbucket: vinayak_jagtap).

The original report had attachments: crash.btall.txt

We are seeing intermittent segfaults in gazebo at the point when the robot moves out of the start box. The simulation hangs and does not respond for about 30 seconds after which the process dies. Following message is seen in dmesg. Did anyone else see this behavior?

[  396.949329] gzserver[7810]: segfault at 10 ip 00007f274029fb7e sp 00007f26daffc700 error 4 in libgazebo_transport.so.7.6.0[7f274025a000+69000]

osrf-migration commented 7 years ago

Original comment by Vinayak Jagtap (Bitbucket: vinayak_jagtap).

Just after posting this issue I noticed that version of libgazebo was 7.6 on that computer. Hopefully, it is fixed in 7.7.

osrf-migration commented 7 years ago

Original comment by sumanth nirmal (Bitbucket: Sumanth-Nirmal).

It fails with gazebo 7.7 also, with the same message gzserver[23163]: segfault at 10 ip 00007f7181ee2b7e sp 00007f7136ffc700 error 4 in libgazebo_transport.so.7.7.0[7f7181e9d000+69000]

osrf-migration commented 7 years ago

Original comment by Louise Poubel (Bitbucket: chapulina, GitHub: chapulina).

set component to "world"

osrf-migration commented 7 years ago

Original comment by Louise Poubel (Bitbucket: chapulina, GitHub: chapulina).

Is this happening for all tasks? Which world are you using? What messages do you see on the terminal before the crash? For example:

[Msg] Task [1] - Started: time will start counting as you leave the box.
[Msg] Started box contains plugin [task1/start]
[Msg] Task [1] - Checkpoint [1] - Started (65 21000000)
[Msg] Started box contains plugin [task1/checkpoint1]

osrf-migration commented 7 years ago

Original comment by Vinayak Jagtap (Bitbucket: vinayak_jagtap).

We were using unique_task1.world, I'll add more details when I see that error again.

osrf-migration commented 7 years ago

Original comment by Rud Merriam (Bitbucket: rmerriam).

Don't have details to post but I've seen it, also, a couple times.

osrf-migration commented 7 years ago

Original comment by Louise Poubel (Bitbucket: chapulina, GitHub: chapulina).

Thanks. If you're building from source, one thing that might help when upgrading Gazebo versions is to recompile your whole stack (delete the build folder).

osrf-migration commented 7 years ago

Original comment by Rud Merriam (Bitbucket: rmerriam).

I have seen this happen right after the Task 1 Checkpoint 1 message is posted on the console. I'm not building from source, using only the packaged competition software.

osrf-migration commented 7 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).

If you are running inside docker, you can pass --ulimit core=1000000000:1000000000 to the docker run command so that it will generate corefiles. I think they will be created in ~/.ros/. That could be useful for indicating what caused the seg-fault.

osrf-migration commented 7 years ago

Original comment by sumanth nirmal (Bitbucket: Sumanth-Nirmal).

I am not running inside docker, this is the message I see on terminal after segfault:

[Msg] Task [1] - Started: time will start counting as you leave the box.

[Msg] Started box contains plugin [task1/start]
[Msg] Task [1] - Checkpoint [1] - Started (82 972000000)
Segmentation fault
[gazebo-6] process has died [pid 23258, exit code 139, cmd /opt/ros/indigo/lib/gazebo_ros/gzserver --verbose -e ode /opt/ros/indigo/share/srcsim/worlds/unique_task1.world

osrf-migration commented 7 years ago

Original comment by Vinayak Jagtap (Bitbucket: vinayak_jagtap).

Seeing the same error in task 2 when checkpoint 1 is completed. Are there any specific logs that we should share from local computer not running docker that would help in troubleshooting this issue?

srcsim is installed using apt-get.

[Msg] Model [solar_panel] started touching [valkyrie] at 482 897000000 seconds
[Msg] Model [solar_panel] touched [valkyrie] exclusively for 0 500000000 seconds
[Msg] Stopped touch plugin [task2/checkpoint1]
[Msg] Task [2] - Checkpoint [1] - Completed (483 399000000)
[Msg] Task [2] - Checkpoint [2] - Started (483 399000000)
Segmentation fault
[gazebo-6] process has died [pid 27042, exit code 139, cmd /opt/ros/indigo/lib/gazebo_ros/gzserver --verbose -e ode /opt/ros/indigo/share/srcsim/worlds/unique_task2.world __name:=gazebo __log:=/home/whrl/.ros/log/685d873a-3a64-11e7-b645-902b34d61a86/gazebo-6.log].
log file: /home/whrl/.ros/log/685d873a-3a64-11e7-b645-902b34d61a86/gazebo-6*.log

osrf-migration commented 7 years ago

Original comment by Vinayak Jagtap (Bitbucket: vinayak_jagtap).

Saw this error twice today after picking up the solar panel in task2. I'm using higher_array branch.

[Msg] Stopped touch plugin [task2/checkpoint1]
[Msg] Task [2] - Checkpoint [1] - Completed (113 869000000)
[Msg] Task [2] - Checkpoint [2] - Started (113 869000000)
gzserver: /usr/include/boost/smart_ptr/shared_ptr.hpp:653: typename boost::detail::sp_member_access<T>::type boost::shared_ptr<T>::operator->() const [with T = gazebo::transport::Publisher; typename boost::detail::sp_member_access<T>::type = gazebo::transport::Publisher*]: Assertion `px != 0' failed.
Aborted (core dumped)
[gazebo-6] process has died [pid 2619, exit code 134, cmd /opt/ros/indigo/lib/gazebo_ros/gzserver -r --verbose -e ode /home/ninja/indigo_ws/src/space_robotics_challenge/external/srcsim/worlds/unique_task2.world __name:=gazebo __log:=/home/ninja/.ros/log/c3d2c4d0-400f-11e7-933f-94de806528fb/gazebo-6.log].
log file: /home/ninja/.ros/log/c3d2c4d0-400f-11e7-933f-94de806528fb/gazebo-6*.log

osrf-migration commented 7 years ago

Original comment by Louise Poubel (Bitbucket: chapulina, GitHub: chapulina).

I suspect these Gazebo transport errors are due to race conditions, which is really tough to reproduce and debug. If anyone is running the competition through gdb, it would be great to get a backtrace.

osrf-migration commented 7 years ago

Original comment by Shlok Agarwal (Bitbucket: shlokagarwal).

Saw this error while starting task2.

(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
#1  0x00007fff42392694 in gazebo::Task::Update(gazebo::common::Time const&) ()
   from /opt/ros/indigo/lib/libFinalsPlugin.so
#2  0x00007fff42384999 in gazebo::FinalsPlugin::OnUpdate(gazebo::common::UpdateInfo const&) ()
   from /opt/ros/indigo/lib/libFinalsPlugin.so
#3  0x00007ffff62e9219 in ?? () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.7
#4  0x00007ffff62d5dc1 in gazebo::physics::World::Update() ()
   from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.7
#5  0x00007ffff62e400b in gazebo::physics::World::Step() () from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.7
#6  0x00007ffff62e447d in gazebo::physics::World::RunLoop() ()
   from /usr/lib/x86_64-linux-gnu/libgazebo_physics.so.7
#7  0x00007ffff42c3a4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
#8  0x00007ffff55a5184 in start_thread (arg=0x7fff437fe700) at pthread_create.c:312
#9  0x00007ffff6973bed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)

osrf-migration commented 7 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).

Thanks for the backtrace. This is pointing towards something in the Task::Update function.

osrf-migration commented 7 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).

Can you provide console output and a multi-thread backtrace next time?

(gdb) thread apply all bt

The line you've pointed at just looks like it's waiting for the mutex to be freed. Does the simulation crash or is it just hanging at this point?

osrf-migration commented 7 years ago

Original comment by Shlok Agarwal (Bitbucket: shlokagarwal).

@stevenpeter91 . I came across the same error again (while starting task2 ) and this is the console output after (gdb) thread apply all bt

Msg] [Harness] Detached

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff437fe700 (LWP 30386)]
__GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
66  ../nptl/pthread_mutex_lock.c: No such file or directory.
(gdb) thread apply all bt

Thread 63 (Thread 0x7ffef97fa700 (LWP 31071)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007fffd4912d75 in bool boost::condition_variable::timed_wait<boost::date_time::subsecond_duration<boost::posix_time::time_duration, 1000000l> >(boost::unique_lock<boost::mutex>&, boost::date_time::subsecond_duration<boost::posix_time::time_duration, 1000000l> const&) () from /opt/ros/indigo/lib/libroscpp.so
#2  0x00007fffd4910d8d in ros::CallbackQueue::callAvailable(ros::WallDuration) ()
   from /opt/ros/indigo/lib/libroscpp.so
#3  0x00007fff085412d2 in gazebo::SRCSimRosHarness::QueueThread() ()
   from /opt/ros/indigo/lib/libsrcsim_ros_harness.so
#4  0x00007ffff42c3a4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
#5  0x00007ffff55a5184 in start_thread (arg=0x7ffef97fa700) at pthread_create.c:312
#6  0x00007ffff6973bed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 62 (Thread 0x7ffef9ffb700 (LWP 31060)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007fffd4912d75 in bool boost::condition_variable::timed_wait<boost::date_time::subsecond_duration<boost::posix_time::time_duration, 1000000l> >(boost::unique_lock<boost::mutex>&, boost::date_time::subsecond_duration<boost::posix_time::time_duration, 1000000l> const&) () from /opt/ros/indigo/lib/libroscpp.so
#2  0x00007fffd4910d8d in ros::CallbackQueue::callAvailable(ros::WallDuration) ()
   from /opt/ros/indigo/lib/libroscpp.so
#3  0x00007fff0af77d2a in gazebo::GazeboRosIMU::IMUQueueThread() () from /opt/ros/indigo/lib/libgazebo_ros_imu.so
#4  0x00007ffff42c3a4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
#5  0x00007ffff55a5184 in start_thread (arg=0x7ffef9ffb700) at pthread_create.c:312
#6  0x00007ffff6973bed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 61 (Thread 0x7ffefa7fc700 (LWP 31049)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238

The controller crashed after this.

osrf-migration commented 7 years ago

Original comment by Jeremy White (Bitbucket: knitfoo).

We are seeing this occur with more frequency; I'm running on r744, and 20% of our recent 'batch' runs exhibited this. (I'm trying to modify the batch to auto generate a thread apply bt all, but it's tricky). My team mates report similar increases of occurrence of this. It could be something in the patches from 0.6 to 0.7 increased the likely hood of this. Could just be superstition. (Timing bugs are so much fun!)

osrf-migration commented 7 years ago

Original comment by Rud Merriam (Bitbucket: rmerriam).

In contrast I'm seeing fewer problems. Have you updated Gazebo? A new package dropped last night...I think it was last night. Was up 24 hours and just go up from 4 hour hap.

osrf-migration commented 7 years ago

Original comment by Jeremy White (Bitbucket: knitfoo).

ah, interesting. Did not know about 7.8.1.1; was on 7.8.0.1. Will try that and see. Thanks for the tip!

osrf-migration commented 7 years ago

Original comment by Jeremy White (Bitbucket: knitfoo).

7.8.1.1 did seem to improve stability. But I did get a run of gazebo crashes. This was on a random world, using r744 with r749 (harness) patched in. The crash came when commanding the sim to start T3CP6. I got a few in a row; attaching one of the traces in the off chance it helps. I'll shift to a non random world and see if I can reproduce.

osrf-migration commented 7 years ago

Original comment by Jeremy White (Bitbucket: knitfoo).

set attachment to "crash.btall.txt"

osrf / srcsim

Segfault in gazebo when robot moves out of start box #162