Open patmarion opened 8 years ago
Thanks for pointing this out. Which version of LCM did you use at JSC? We haven't seen this bug last week on Valkyrie, but we also didn't run it for more than ~20 minutes at a time. We've used LCM 1.3.0 on Valkyrie and 1.1.0 on the operator workstation.
(Wolfgang) We are now seeing this issue as well every now and then. Will investigate. Do you happen to remember whether you started the controller manager through the user interface or through a terminal?
Started though terminal. But, this issue is not specific to controller manager. In fact, it is easy to reproduce this issue with a simple main.cpp program that links against LCM and ROS, such that both libpcre and libPocoFoundation are listed when running ldd on the binary.
@psiorx I think I saw this issue in a program you are working on, too. I think I noticed that one of your terminals printed the "LCM self test failed!!" warning.
Thanks - checked and that's the same thing we are seeing.
Did you check whether there was a difference depending on how you compiled, e.g. whether the output type was release or debug?
Did you see the same issue on a different computer? Or was it specific to link?
No, I don't think it's specific to the link computer, I've seen this issue on other computers. The same issue can also occur when linking Matlab mex libraries (in 2014b and newer), so the issue is not even specific to LCM/ROS. This is a bug that can be reproduced with a simple program on any Ubuntu 14 computer (and possibly other OS versions, haven't tested). The comment I linked to in my original post explains the issue with a lot of detail, but I'm not sure how to resolve it.
Pat, do you have any leads/ideas that I could follow in order to work towards resolving this issue? Would you recommend trying either John's suggestion or LD_PRELOAD?
Empirically without LD_PRELOAD one out of five fails (10 trials). Using LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libpcre.so" roslaunch valkyrie_translator NeckControl.launch
20 went through without crashing. I think that could be an acceptable workout
Do you still see it print the message: "LCM self test failed!!" ?
I think that the right thing to do is to write a minimal c++ program that demonstrates the issue. When it runs under valgrind you will see the corruption. Then check whether using LD_PRELOAD makes the valgrind message disappear. You still might want to find a better fix, though.
The minimal program would call lcm subscribe and also call at least one function from libPocoFoundation, and be linked to both lcm and libPocoFoundation.so. It should even be possible to write the minimal program that only uses g_regex_new (called by lcm subscribe) and remove lcm from the minimal program completely. A possible fix could be to extend LCM so that it has a build option which allows it to use libpcre directly for regex instead of using GLib for regex. The problem appears to be related to mixing GLib's GRegEx with regex api from libpcre, which is done by libPocoFoundation.
Here are some relevant links I found:
https://bugs.launchpad.net/ubuntu/+source/pcre3/+bug/1361610 (see comment 4)
https://github.com/pocoproject/poco/blob/develop/Foundation/src/RegularExpression.cpp
https://developer.gnome.org/glib/stable/glib-building.html (search for pcre on this page)
The "LCM self test failed!!" is gone
Sounds promising, @wxmerkt , thanks for pushing!
This bug occurred when we were testing at JSC. I'm curious if you have seen it? There was a memory bug that causes LCM to incorrectly parse channel names, which can result in LCM failing to subscribe to a channel depending on the channel name. And you might also see this output when LCM initializes:
lcm_subscribe: Error while compiling regular expression ^LCM_SELF_TEST$ at char 0: unknown option bit(s) set LCM self test failed!!
The bug is also visible if you run the program with valgrind (which can be difficult to do with a ROS plugin) The problem is that the wrong version of a regular expression function is called. If you run: ldd /path/to/plugin.so and you see these library: libpcre and libPocoFoundation then you might have the bug. This bug also occurred when we used LCM and Matlab libraries together. See more info at this comment:
https://github.com/openhumanoids/oh-distro-private/issues/644#issuecomment-97905467
And this was the valgrind output when running a simple test main.cpp program that creates an instance of the plugin:
==11504== Invalid read of size 1 ==11504== at 0x7C81AB3: ??? (in /lib/x86_64-linux-gnu/libpcre.so.3.13.1) ==11504== by 0x7C82706: ??? (in /lib/x86_64-linux-gnu/libpcre.so.3.13.1) ==11504== by 0x7C8BA23: pcre_compile2 (in /lib/x86_64-linux-gnu/libpcre.so.3.13.1) ==11504== by 0x6958A7B: g_regex_new (in /lib/x86_64-linux-gnu/libglib-2.0.so.0.4002.0) ==11504== by 0x5B6FF40: lcm_subscribe (lcm.c:321) ==11504== by 0x5B72A4D: udpm_self_test (lcm_udpm.c:822) ==11504== by 0x5B734E3: _setup_recv_parts (lcm_udpm.c:1057) ==11504== by 0x5B72104: lcm_udpm_subscribe (lcm_udpm.c:604) ==11504== by 0x5B6FE72: lcm_subscribe (lcm.c:303) ==11504== by 0x4E669DD: lcm::Subscription* lcm::LCM::subscribe<drc::robot_command_t, valkyrie_translator::LCM2ROSControl_LCMHandler>(