Open cdondrup opened 11 years ago
I occasionally see this problem too, but not often On bob the serial ports are:
chris@bob:~$ dmesg | grep tty
[ 0.000000] console [tty0] enabled
[ 0.459564] serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
[ 0.479957] serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
[ 0.500347] serial8250: ttyS2 at I/O 0x3e8 (irq = 4) is a 16550A
[ 0.520723] serial8250: ttyS3 at I/O 0x2e8 (irq = 3) is a 16550A
[ 0.541753] 00:0a: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
[ 0.562156] 00:0b: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
[ 0.582556] 00:0c: ttyS2 at I/O 0x3e8 (irq = 5) is a 16550A
[ 0.602952] 00:0d: ttyS3 at I/O 0x2e8 (irq = 11) is a 16550A
[ 11.673454] usb 2-1.8: FTDI USB Serial Device converter now attached to ttyUSB0
[ 11.673902] usb 2-1.8: FTDI USB Serial Device converter now attached to ttyUSB1
[ 11.674271] usb 2-1.8: FTDI USB Serial Device converter now attached to ttyUSB2
[ 11.674633] usb 2-1.8: FTDI USB Serial Device converter now attached to ttyUSB3
As far as I understand: we have 4 serial ports built in to the motherboard. These are ttyS[0,1,2,3]. The pan-tilt unit is using ttyS0, another one is free on the motherboard back plate and there must be two more with internal headers, probably unused. Then there are four more serial ports ttyUSB[0,1,2,3] through a USB-serial adapter , which ML use to communicate with the robot base and the laser. The base is controlled using CAN bus, which ML have built an adapter to serial. This goes through ttyUSB2. The sick s300 uses serial through ttyUSB0, which is connected via the "special" laser port on the back of the robot, below the motherboard.
When ever we get the CAN bus problem, we also get the laser checksum problem. This is probably because both are connected on the same USB-serial adapter. ML don't use the same laser driver as us, so maybe they have set some different parameters to make the ML-CAN and laser play together on the same USB-serial adapter?
So we could ask ML if they have any special restrictions on the use of the laser to avoid interference with the CAN, and we could try moving the laser off the same bus and put it into the spare port on the motherboard.
Agreed. Please ask them.
here's the USB setup on Linda:
Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 002 Device 003: ID 04e7:0020 Elo TouchSystems Touchscreen Interface (2700)
Bus 002 Device 004: ID 08bb:29b2 Texas Instruments Japan PCM2902 Audio CODEC
Bus 002 Device 005: ID 1d27:0600
Bus 002 Device 006: ID 05e3:0608 Genesys Logic, Inc. USB-2.0 4-Port HUB
Bus 002 Device 007: ID 0403:6011 Future Technology Devices International, Ltd FT4232H Quad HS USB-UART/FIFO IC
Bus 002 Device 008: ID 046d:c21f Logitech, Inc. F710 Wireless Gamepad [XInput Mode]
Bus 002 Device 009: ID 046d:c52b Logitech, Inc. Unifying Receiver
I sent an email to MetraLabs people to invite them in this github repository today. Then they can work on this with us.
I now invited @chmartin21 (Christian from MetraLabs) to our project so he can comment on these issues. @cdondrup, @gestom, @cburbridge, please keep collecting back traces that originate from MIRA bugs here.
I tried to reproduce this issue. But when I disable USB3 support in the BIOS (as suggest here: https://github.com/strands-project/scitos_robot/wiki/Setup), devices connected to one of the blue ports are not recognized anymore.
At UoL we have gone back to the original wiring after we swapped the Asus metralabs provided (USB id 0601) for one that worked better (id 0600). but that didn't change anything wrt the can bus issue. That one only occurs rarely. We had days without a problem and then two crashes in a day. We actually leave the robot running for quite a long time. I suspected at some point that the ntp updates might be to blame. The computer gets its time updated via Internet. The internal clock doesn't seem to be very precise. Then it can happen that the time is actually set to something that is a few ms in the past?! Not sure though, but some error messages hinted in this direction...
This happens if you run the scitos_bringup when the head motors are not powered:
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<mira::XRPC>'
what(): The property 'Head.EyeLidLeft' could not be found in '/robot/Robot' (/home/chmartin/release/MIRA/framework/src/fw/Authority.C:740)
[ERROR] [1376645255.026592116]: (MIRA) Aborted!
[ERROR ] 2013-Aug-16 10:27:34.381417 Aborted!
[CALLSTACK]:
#0 0x7fb989136425 in gsignal from /lib/x86_64-linux-gnu/libc.so.6
at /build/buildd/eglibc-2.15/signal/../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x7fb989139b8b in abort from /lib/x86_64-linux-gnu/libc.so.6
at /build/buildd/eglibc-2.15/stdlib/abort.c:91
#2 0x7fb989a8969d in __gnu_cxx::__verbose_terminate_handler() from libstdc++.so.6
#3 0x7fb989a87846 in from libstdc++.so.6
#4 0x7fb989a87873 in from libstdc++.so.6
#5 0x7fb989a8796e in from libstdc++.so.6
#6 0x642a52 in boost::exception_detail::clone_impl<mira::XRPC>::rethrow() const from scitos_node
#7 0x62301e in boost::rethrow_exception(boost::shared_ptr<boost::exception_detail::clone_base const> const&) from scitos_node
#8 0x62371e in boost::detail::future_object_base::wait(bool) from scitos_node
#9 0x630d17 in boost::detail::future_object<std::string>::get() from scitos_node
#10 0x62d86f in boost::unique_future<std::string>::get() from scitos_node
#11 0x628122 in mira::RPCFuture<std::string>::get() from scitos_node
#12 0x6209ef in ScitosModule::get_mira_param_(std::string) from scitos_node
#13 0x718fc2 in ScitosHead::publish_joint_state_actual() from scitos_node
#14 0x71fc0c in boost::_mfi::mf0<void, ScitosHead>::operator()(ScitosHead*) const from scitos_node
#15 0x71eeb0 in void boost::_bi::list1<boost::_bi::value<ScitosHead*> >::operator()<boost::_mfi::mf0<void, ScitosHead>, boost::_bi::list0>(boost::_bi::type<void>, boost::_mfi::mf0<void, ScitosHead>&, boost::_bi::list0&, int) from scitos_node
[scitos_node-1] process has died [pid 3083, exit code -6, cmd /opt/strands/strands_catkin_ws/devel/lib/scitos_mira/scitos_node __name:=scitos_node __log:=/localhome/strands/.ros/log/dae168ea-0655-11e3-89c2-843a4b7ee1a8/scitos_node-1.log].
log file: /localhome/strands/.ros/log/dae168ea-0655-11e3-89c2-843a4b7ee1a8/scitos_node-1*.log
I experienced the time update crash (which Marc mentioned) again:
[ WARN] [1376935684.422910385]: Saw a negative time change of -14.970286 seconds, clearing the tf buffer.
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<mira::XRPC>'
what(): Failed to read SDO node(0x08) index(0x2006) subindex (0x01) (/home/chmartin/release/MIRA-commercial/toolboxes/CAN/include/can/CANOpenSDOClient.h:134)
[ERROR] [1376935687.912609793]: (MIRA) Aborted!
[ERROR ] 2013-Aug-19 19:08:06.334509 Aborted!
[CALLSTACK]:
#0 0x7f5de45d2425 in gsignal from /lib/x86_64-linux-gnu/libc.so.6
at /build/buildd/eglibc-2.15/signal/../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x7f5de45d5b8b in abort from /lib/x86_64-linux-gnu/libc.so.6
at /build/buildd/eglibc-2.15/stdlib/abort.c:91
#2 0x7f5de4f2569d in __gnu_cxx::__verbose_terminate_handler() from libstdc++.so.6
#3 0x7f5de4f23846 in from libstdc++.so.6
#4 0x7f5de4f23873 in from libstdc++.so.6
#5 0x7f5de4f2396e in from libstdc++.so.6
#6 0x642a52 in boost::exception_detail::clone_impl<mira::XRPC>::rethrow() const from scitos_node
#7 0x62301e in boost::rethrow_exception(boost::shared_ptr<boost::exception_detail::clone_base const> const&) from scitos_node
#8 0x62371e in boost::detail::future_object_base::wait(bool) from scitos_node
#9 0x630d17 in boost::detail::future_object<std::string>::get() from scitos_node
#10 0x62d86f in boost::unique_future<std::string>::get() from scitos_node
#11 0x628122 in mira::RPCFuture<std::string>::get() from scitos_node
#12 0x6209ef in ScitosModule::get_mira_param_(std::string) from scitos_node
#13 0x718fc2 in ScitosHead::publish_joint_state_actual() from scitos_node
#14 0x71fc0c in boost::_mfi::mf0<void, ScitosHead>::operator()(ScitosHead*) const from scitos_node
#15 0x71eeb0 in void boost::_bi::list1<boost::_bi::value<ScitosHead*> >::operator()<boost::_mfi::mf0<void, ScitosHead>, boost::_bi::list0>(boost::_bi::type<void>, boost::_mfi::mf0<void, ScitosHead>&, boost::_bi::list0&, int) from scitos_node
[scitos_node-1] process has died [pid 7046, exit code -6, cmd /opt/strands/strands_catkin_ws/devel/lib/scitos_mira/scitos_node __name:=scitos_node __log:=/localhome/strands/.ros/log/dae168ea-0655-11e3-89c2-843a4b7ee1a8/scitos_node-1.log].
log file: /localhome/strands/.ros/log/dae168ea-0655-11e3-89c2-843a4b7ee1a8/scitos_node-1*.log
The difference of 15 sec was accumulated over just a few (up to 5) minutes.
This happened a few minutes after the first crash:
[ WARN] [1376936526.488113216]: (MIRA) Invalid CRC on MetraLabs CAN-to-Serial Adapter. Is 0x5426 should 0x30ce
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<mira::XRPC>'
what(): Failed to read SDO node(0x08) index(0x2006) subindex (0x02) (/home/chmartin/release/MIRA-commercial/toolboxes/CAN/include/can/CANOpenSDOClient.h:134)
[WARNING ] 2013-Aug-19 19:22:05.687177 Invalid CRC on MetraLabs CAN-to-Serial Adapter. Is 0x5426 should 0x30ce
[ WARN] [1376936527.057936482]: (MIRA) Invalid CRC on MetraLabs CAN-to-Serial Adapter. Is 0xca00 should 0xca2a
[WARNING ] 2013-Aug-19 19:22:07.057901 Invalid CRC on MetraLabs CAN-to-Serial Adapter. Is 0xca00 should 0xca2a
[ WARN] [1376936527.058846249]: (MIRA) Invalid CRC on MetraLabs CAN-to-Serial Adapter. Is 0x3868 should 0x908b
[WARNING ] 2013-Aug-19 19:22:07.058825 Invalid CRC on MetraLabs CAN-to-Serial Adapter. Is 0x3868 should 0x908b
[ERROR] [1376936529.927020905]: (MIRA) Aborted!
[ERROR ] 2013-Aug-19 19:22:07.425553 Aborted!
[CALLSTACK]:
#0 0x7f60a4451425 in gsignal from /lib/x86_64-linux-gnu/libc.so.6
at /build/buildd/eglibc-2.15/signal/../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x7f60a4454b8b in abort from /lib/x86_64-linux-gnu/libc.so.6
at /build/buildd/eglibc-2.15/stdlib/abort.c:91
#2 0x7f60a4da469d in __gnu_cxx::__verbose_terminate_handler() from libstdc++.so.6
#3 0x7f60a4da2846 in from libstdc++.so.6
#4 0x7f60a4da2873 in from libstdc++.so.6
#5 0x7f60a4da296e in from libstdc++.so.6
#6 0x4978f2 in boost::exception_detail::clone_impl<mira::XRPC>::rethrow() const from scitos_node
#7 0x491fc4 in from scitos_node
at ScitosModule.cpp:0
#8 0x493398 in ScitosModule::get_mira_param_(std::string) from scitos_node
#9 0x538525 in ScitosHead::publish_joint_state_actual() from scitos_node
#10 0x4a8e08 in ScitosG5::spin() from scitos_node
#11 0x487df8 in main from scitos_node
#12 0x7f60a443c76d in __libc_start_main from /lib/x86_64-linux-gnu/libc.so.6
at /build/buildd/eglibc-2.15/csu/libc-start.c:226
#13 0x48961d in from scitos_node
Afterwards everything was fine again.
@chmartin21 Have you any idea about the CAN bus issues we are reporting here?
@chmartin21 have you any ideas about this bug? We are aiming to have 5 days of uninterrupted autonomy, and currently this is one of the show stoppers. As said above, we deembthe problem to be related to timing, as it happens often when the robot has an ntp update when reconnecting to the network?
maybe we could try to deal with this error by allowing the ntp update only while the robot is on the charging station and the CAN traffic is (nearly-zero) low ? I guess we could even suspend the roscore during this time to make sure no CAN messages are exchanged.
What about setting up a local ntp server on the robot?
At the moment, we don't have a real idea what might be the reason for this behavior. Do you have tried to work without NTP? Maybe the big negative time changes are really a problem.
Furthermore, I noticed, that all backtraces here a coming from ScitosModule::get_miraparam when calling the get() on the RPCFuture. Is there any exception handling implemented in this method? Is the source code available here? Where can I find it?
No exception handling. The code lives in the scitos_driver repository: https://github.com/strands-project/scitos_drivers/blob/master/scitos_mira/src/ScitosModule.cpp. I use a mira::RPCFuture::timedWait - could this be confused by the NTP update?
If I add exception handling, I suppose the driver won't die when time updates happen. But what about the CRC warning on the CAN bus? Is it related?
The mira::RPCFuture uses a boost::unique_future. So I can't say, if a NTP update can confuse the timedWait.
I strongly recommend to implement an exception handling. Each RPC might fail for some reason. Checking for errors is always a good idea... ;-) For some more details, please refer to the documentation: https://www.mira-project.org/MIRA-doc/ServiceRPCPage.html#ServicesException
About the CRC warning: How often does this warning occur?
We have turned off the NTP daemon on Linda two weeks ago. We have not experienced any CAN crashes since then.
Also, @cburbridge, have you wrapped all the respective MIRA calls in try-catch statements? Disabling NTP is a bad idea in the long run, when we run the system in a distributed fashion. The system should be stable even if it receives time updates.
@chmartin21 even with the error catching, there are still "Invalid CRC on MetraLabs CAN-to-Serial Adapter" errros. These occur when the time changes on the robot, and can easily be repeated by doing so manually: sudo date -s "Mon Oct 28 14:31:30 GMT 2013"
.
Sometimes the errors don't stop, other times there are a 2 or 3 invalid CRCs then it is ok. Any idea how to solve this?
Hmm, the time changes (NTP or manually) seems to affect the serial data transfer. In this case, the corrupted packages will be detected by means of the CRC. At the moment, I can't see a reason, why the errors should not stop. We will try to reproduce and debug this issue.
I think that this is not really an issue, anymore. Does anyone still have devestating problems with CAN messages? A few messages here and there were still experiences on bob at G4S, but on the whole it was OK. @BFALacerda @cdondrup @RaresAmbrus @nilsbore @AlexanderHermans ?
It's OK on Rosie.
I think we still got that issue. Maybe @cdondrup and @Jailander can comment on this?
Yes we still have that problem once in a while will post it next time I see it
We have had this problem again and again on Werner@AAF these days. It's currently the major fault for anything at AAF.
If the driver node dies, would respawn=true work around the issue? Otherwise we could adapt the mira bridge so that if no odometry is published for more time than a threshold, exit and respawn. All other components would need to be resilient to a break in /odom
etc..
We already put respawn=true
, so it does respawn (but not always, as it not always dies when this happens). But after a respawn, AMCL is completely off and needs to be re-initialised. So, it's not a solution. Any idea why this could be the case? We don't know why it then gets lost, to be honest.
So, they are not resilient...
When the scitos node stops, the laser scans still come but the odometry no longer arrives. So you should get Couldn't determine robot's pose associated with laser scan
error messages until the scitos node comes back. If you don't get this error then that means there must still be the odometry TF published by the scitos node or a bug somewhere.
I assume you do get that error (although I did not test, just reading the amcl code...). So if the pose is messed up before the scitos node returns after dying then possibly as it dies the odometry TF is reported stupidly. Alternatively, if it is screwed up when the scitos node comes back then there may be a bad first few odometry messages coming from mira as the base gets started. I'll try and look at this on Betty tomorrow.
The problem is that the scitos node resets odometry whenever it starts, screwing AMCL. The fix is easy: https://github.com/strands-project/scitos_drivers/compare/indigo-devel...cburbridge:patch-1 will stop the respawing screwing up.
I just experienced the Scitos node crash now as well, had the same problem with AMCL.
see fix above.
partially fixed via https://github.com/strands-project/scitos_drivers/pull/111 (if the node dies and it respawns, it shouldn't screw up the odometry anymore).
I leave this open, as the problem itself still persists.
Here's another instance of this crash:
[INFO] [WallTime: 1432815459.190434] going to (-3.14159265359, 0.174532925199)
SerialCommS300: Checksum's dont match, thats bad (data packet size 1104)
SerialCommS300: Checksum's dont match, thats bad (data packet size 1104)
[ WARN] [1432816237.333813256]: (MIRA) Invalid CRC on MetraLabs CAN-to-Serial Adapter. Is 0x1af should 0x9155
[WARNING ] 2015-May-28 14:30:37.314299 Invalid CRC on MetraLabs CAN-to-Serial Adapter. Is 0x1af should 0x9155
[ WARN] [1432816238.772981854]: (MIRA) Invalid CRC on MetraLabs CAN-to-Serial Adapter. Is 0x7cc should 0x243b
[WARNING ] 2015-May-28 14:30:38.772946 Invalid CRC on MetraLabs CAN-to-Serial Adapter. Is 0x7cc should 0x243b
[ WARN] [1432816238.775885659]: (MIRA) Invalid message length on MetraLabs CAN-to-Serial Adapter
[WARNING ] 2015-May-28 14:30:38.775860 Invalid message length on MetraLabs CAN-to-Serial Adapter
[ WARN] [1432816238.778063397]: (MIRA) Invalid CRC on MetraLabs CAN-to-Serial Adapter. Is 0x3925 should 0xbe4a
[WARNING ] 2015-May-28 14:30:38.778029 Invalid CRC on MetraLabs CAN-to-Serial Adapter. Is 0x3925 should 0xbe4a
SerialCommS300: Checksum's dont match, thats bad (data packet size 1104)
SerialCommS300: Checksum's dont match, thats bad (data packet size 1104)
[ERROR] [1432816258.545004546]: (MIRA) /robot/can/CANDriver Recovering: : Incoming message watchdog triggered. No incoming messages.
[ERROR ] 2015-May-28 14:30:58.319955 /robot/can/CANDriver Recovering: : Incoming message watchdog triggered. No incoming messages.
[ WARN] [1432816260.789890110]: Missing head angle publication as MIRA parameter error.
terminate called after throwing an instance of 'mira::XRPC'
what(): The requested service '/robot/Robot' is not known. (/opt/mira-scitos/base/src/rpc/RPCManager.C:459)
[ERROR] [1432816264.260933385]: (MIRA) Aborted!
[ERROR ] 2015-May-28 14:31:03.417341 Aborted!
[CALLSTACK]:
#0 0x7faf3a973cc9 in gsignal from /lib/x86_64-linux-gnu/libc.so.6
#1 0x7faf3a9770d8 in abort from /lib/x86_64-linux-gnu/libc.so.6
#2 0x7faf3b27e6b5 in _ZN9__gnu_cxx27__verbose_terminate_handlerEv from libstdc++.so.6
#3 0x7faf3b27c836 in from libstdc++.so.6
#4 0x7faf3b27c863 in from libstdc++.so.6
#5 0x7faf3b27caf6 in __cxa_rethrow from libstdc++.so.6
#6 0x7faf3d0d191b in _ZN3ros13CallbackQueue9callOneCBEPNS0_3TLSE from /opt/ros/indigo/lib/libroscpp.so
#7 0x7faf3d0d2583 in _ZN3ros13CallbackQueue13callAvailableENS_12WallDurationE from /opt/ros/indigo/lib/libroscpp.so
#8 0x7faf3d116354 in _ZN3ros16AsyncSpinnerImpl10threadFuncEv from /opt/ros/indigo/lib/libroscpp.so
#9 0x7faf3c3a3a4a in from libboost_thread.so.1.54.0
#10 0x7faf3c182182 in from libpthread.so.0
#11 0x7faf3aa3747d in clone from /lib/x86_64-linux-gnu/libc.so.6
[scitos_node-1] process has died [pid 2156, exit code -6, cmd /home/strands/aaf_ws/devel/lib/scitos_mira/scitos_node __name:=scitos_node __log:=/home/strands/.ros/log/dac95480-0095-11e5-bb61-00032d225881/sc
itos_node-1.log].
log file: /home/strands/.ros/log/dac95480-0095-11e5-bb61-00032d225881/scitos_node-1*.log
CAN bus tends to crash: