rosflight / rosflight_ros_pkgs

ROS packages for the ROSflight autopilot
http://rosflight.org/
BSD 3-Clause "New" or "Revised" License
86 stars 56 forks source link

"boost thread: trying joining itself: Resource deadlock avoided" when running rosflight_io #74

Closed skohlbr closed 9 months ago

skohlbr commented 6 years ago
  1. Ubuntu 16.04/kinetic/rosflight compiled from source
  2. branch/fork: https://github.com/tu-darmstadt-ros-pkg/rosflight/tree/enu_output (This is equivalent to master, but with some minor edits to output IMU data in ENU format).
  3. Machine: NUC with Core i5-7260U

When trying to connect to the flight controller running

rosrun rosflight rosflight_io

I get the following:

[ INFO] [1523636817.582107242]: Connecting to serial port "/dev/naze_fc", at 921600 baud
terminate called after throwing an instance of 'boost::exception_detail::clone_impl >'
  what():  boost thread: trying joining itself: Resource deadlock avoided

I tried different USB ports and things started working as expected after plugging the cable into one of them. Afterwards I could also use other USB ports and everything worked as expected. Then I rebooted the system and I couldn't get it to work again, no matter how much USB port switching I'd do. Getting the exception seems to be the default now and I have no idea how to fix it currently. I noticed the same issue is described in https://github.com/byu-magicc/fcu_io/issues/42. I tried removing the drivers with rmmod, but this did not appear to help. I also peppered the startup code of the rosflight_io node with sleep calls (as the expection looks like some race conditions/threading issue), but this does not help either.

Here's a gdb backtrace (rosflight compiled in Release mode though, can try Debug later):

(gdb) bt
#0  0x00007ffff634d428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff634f02a in __GI_abort () at abort.c:89
#2  0x00007ffff6c9084d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff6c8e6b6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff6c8e701 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff6c8e919 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7bae5fc in void boost::throw_exception(boost::thread_resource_error const&) () from /home/jasmine/argo/devel/.private/rosflight/lib/libmavrosflight.so
#7  0x00007ffff7ba93f7 in mavrosflight::MavlinkComm::close() () from /home/jasmine/argo/devel/.private/rosflight/lib/libmavrosflight.so
#8  0x00007ffff7bb4157 in boost::asio::detail::descriptor_read_op >::do_complete(boost::asio::detail::task_io_service*, boost::asio::detail::task_io_service_operation*, boost::system::error_code const&, unsigned long) () from /home/jasmine/argo/devel/.private/rosflight/lib/libmavrosflight.so
#9  0x00007ffff7badc18 in boost::asio::io_service::run() () from /home/jasmine/argo/devel/.private/rosflight/lib/libmavrosflight.so
#10 0x00007ffff61035d5 in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.58.0
#11 0x00007ffff5edc6ba in start_thread (arg=0x7fffeaffd700) at pthread_create.c:333
#12 0x00007ffff641f41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Any ideas/hints on how to fix things are appreciated.

dpkoch commented 6 years ago

Have you verified that you can connect to /dev/naze_fc with other programs, like screen or miniterm, when you're getting this message? That's the error message that also gets thrown if the USB is unplugged while rosflight_io is running, so it'd be good to verify that this problem is specific to rosflight_io and isn't a system level thing.

skohlbr commented 6 years ago

Ok so after your comment my first suspected culprit was the USB cable. Tried multiple ones, same issues.

Then I tried connecting the FC to another computer running the same S/W setup - worked as intended every single time.

Running screen

screen /dev/ttyUSB0 921600

gives me this kind of stuff running across the screen:

�������!�������������!������������� ������������ �����������L ������������L���������������,!��~��������!�������������!�������������!�����������$!������������`!��������<���A!������������` ������������ ��������� ����������� �����������, ��~���������m ������������!��<���������i!������������!������������ !��}����������!������������E!������������!!�����
      ��                  ����I ������������i!������������(!�����������e!�������������!�����                                                               �����D ������������! ������������� ���������!�� ��}���������������                                                                                                                                                        �����-!�����������I!�������������!������������!�������a �����������I ������������ ������������� ��������� ������������!��>��������H!�����������h!���������� !��-�    �}���O���L ������������H!�����
                                   �Y?���!�����������Ip��������\���9����������!�����������a!������y��}��`!�����������!�]���������e�������g���� ����������������%!������������!��x�������� ���������`

This (from manual inspection) looks the same on the machine where things are working and the one where I experience the described issue.

Going out on a limb here, but maybe there is some race conditions in mavrosflight that gets exposed on this relatively fast single-core performance (7th generation i5) CPU (?)

skohlbr commented 6 years ago

Ok, so I compiled in 'Debug' mode to get a better backtrace and now the problem is gone. Given 'Debug' runs slower, this is another reason to suspect that we're seeing a timing/race condition issue.

skohlbr commented 6 years ago

So using some console output in 'Release' I could confirm that the expection happens in this line: https://github.com/rosflight/rosflight/blob/master/rosflight/src/mavrosflight/mavlink_comm.cpp#L73 (not very surprising I guess :) ).

superjax commented 6 years ago

That is super interesting. Thanks for looking into this! I'm running 6th gen i7's on several computers and have never seen this, except in the condition @dpkoch described. How is your computer so fast!?! Haha!

I've seen a similar problem, but with different conditions, and solved it by resetting the port through linux with

stty -F /dev/ttyUSB0 sane

I wonder if that might help here too, though from what you've said it sounds like something different.

dpkoch commented 6 years ago

Interesting. So the error message is coming from inside the close() function, which is gets called when a serial error is encountered elsewhere. This is what we see when the USB is unplugged, but something else must be causing an error for you.

So it's likely that there is a race condition that would be nice to fix, but for now it looks like we've only seen it after another error has already killed the serial communication anyway. So the more critical fix for you is most likely going to be figuring out why an error is occurring in the first place