personalrobotics / owd

OpenWAM ROS driver for controlling the Barrett WAM and BarrettHand.
2 stars 3 forks source link

pcan write buffer full crash #2

Open jeking04 opened 9 years ago

jeking04 commented 9 years ago

We have seen this twice today.

Here is the output of cat /proc/pcan

[HYDRO] prdemo@herb0:~/ros-hydro/src/table_clearing/src/table_clearing_demo$ cat /proc/pcan
*------------- PEAK-System CAN interfaces (www.peak-system.com) -------------
*------------- Release_20141219_n (7.14.0) Jan  9 2015 17:08:50 --------------
*---------------- [mod] [isa] [pci] [dng] [par] [usb] [pcc] -----------------
*--------------------- 4 interfaces @ major 249 found -----------------------
*n -type- -ndev- --base-- irq --btr- --read-- --write- --irqs-- -errors- status
 0    pci   -NA- f0600000 016 0x001c 00000000 00000000 00000000 00000000 0x0000
 1    pci   -NA- f0600400 016 0x0014 0034c499 001615f1 004ad957 00000002 0x0012
 2    pci   -NA- f0600800 016 0x001c 00000000 00000000 00000000 00000000 0x0000
 3    pci   -NA- f0600c00 016 0x001c 00000000 00000000 00000000 00000000 0x0000
mkoval commented 9 years ago

According to /usr/include/pcan.h, this error code is an overrun in receive buffer. This is a bit surprising because OWD prints that the write buffer is full:

#define CAN_ERR_OVERRUN        0x0002  // overrun in receive buffer
vandeweg commented 9 years ago

The “errors” column in /proc/pcan is a count of errors, not the error flag.

Mike

On Jan 27, 2015, at 7:21 PM, Michael Koval notifications@github.com wrote:

According to /usr/include/pcan.h, this error code is an overrun in receive buffer. This is a bit surprising because OWD prints that the write buffer is full:

define CAN_ERR_OVERRUN 0x0002 // overrun in receive buffer

— Reply to this email directly or view it on GitHub.

mkoval commented 9 years ago

I should have known that. I always mistakenly think it's an error code because it's printed in hex.

In any case, due to the bug I just fixed in 4a799a37d5c8b4668c7dbbecc7d783e409a7fd38, we don't really know if the arm is faulting due to a "write buffer full" error or not. The driver actually prints "write buffer full," with the corresponding error code, when any write error occurs.

jeking04 commented 9 years ago

We failed to capture the output from OWD yesterday. Here it is from a fault today:

[ INFO] [1422465567.586704246]: Position released by SetStiffness command
[ WARN] [1422465575.609377420]: CANbus::request_positions: send failed: CAN_Write failed (0x80): write buffer full?
[FATAL] [1422465575.609418891]: control_loop: request_positions failed
[ERROR] [1422465575.609443519]: Control loop finished
[FATAL] [1422465575.609469378]: Destroying class CANbus
[FATAL] [1422465575.609516053]: ControlLoop::stop: pthread_join failed.
[Thread 0x7fffee35f700 (LWP 3989) exited]

I will pull in the patch and post an update next time we see the failure.

mkoval commented 9 years ago

Error code 0x00 is CAN_ERR_QXMTFULL:

#define CAN_ERR_QXMTFULL       0x0080  // transmit queue full

This is indeed the error that triggers the bug fixed in 4a799a3. It will be good to see what the actual error code is, once we apply the patch.

jeking04 commented 9 years ago

Here is an update:

[ INFO] [1422467849.426043494]: Position released by SetStiffness command
[ WARN] [1422467850.240028258]: CANbus::request_property: send failed: CAN_Write failed (0x80): write buffer full?
[ WARN] [1422467850.240106553]: Failed to request THERM from hand pucks: CAN_Write failed (0x80): write buffer full?
[FATAL] [1422467850.240139752]: control_loop: request_hand_state_rt failed
[ERROR] [1422467850.240191985]: Control loop finished
[FATAL] [1422467850.240226960]: Destroying class CANbus
[FATAL] [1422467850.240276748]: ControlLoop::stop: pthread_join failed.
[Thread 0x7fffee35f700 (LWP 5958) exited]
jeking04 commented 9 years ago

Capturing more to keep a record as we debug:

[ INFO] [1422480267.232514099]: Added trajectory 75c6c33a
[ WARN] [1422480268.581974133]: CANbus::request_positions: send failed: CAN_Write failed (0x80): write buffer full?
[FATAL] [1422480268.582012020]: control_loop: request_positions failed
[ERROR] [1422480268.582041021]: Control loop finished
[FATAL] [1422480268.582077779]: Destroying class CANbus
[FATAL] [1422480268.582118652]: ControlLoop::stop: pthread_join failed.
[Thread 0x7fffee35f700 (LWP 9090) exited]
vandeweg commented 9 years ago

I'm pretty certain these errors are tied to hardware failures on the Peak card. Did Mike1 ever replace the Peak card? I think he had been waiting for the WAMs to get back from Barrett. I've always been very suspicious of the first four-port card he was using when he did the big Herb upgrade.

Mike

On 01/28/2015 04:40 PM, Jennifer King wrote:

Capturing more to keep a record as we debug:

[ INFO] [1422480267.232514099]: Added trajectory 75c6c33a [ WARN] [1422480268.581974133]: CANbus::request_positions: send failed: CAN_Write failed (0x80): write buffer full? [FATAL] [1422480268.582012020]: control_loop: request_positions failed [ERROR] [1422480268.582041021]: Control loop finished [FATAL] [1422480268.582077779]: Destroying class CANbus [FATAL] [1422480268.582118652]: ControlLoop::stop: pthread_join failed. [Thread 0x7fffee35f700 (LWP 9090) exited]

— Reply to this email directly or view it on GitHub https://github.com/personalrobotics/owd/issues/2#issuecomment-71921926.

mkoval commented 9 years ago

I switched to the new CAN card a few weeks ago, after we got the right arm back from Barrett. I was careful to avoid ESD damage while installing it. I also covered the exposed contacts on the top of the card with Kapton tape (like Mike1 did with the previous card) since there wasn't much clearance with the top of the case.

This card has only been used with the repaired arm and the head. There is no way that it could have been damaged by the electrical problems with HERB's left arm.

mkoval commented 9 years ago

To be clear, this is exactly the same model number as the old card. It's just a drop-in replacement.

jeking04 commented 9 years ago
[ INFO] [1422491370.019571923]: Trajectory 3352255a has finished; new reference position is [ 3.6800 -1.9000 0.0000 2.2000 0.0000 0.0000 0.0000 ]
[ WARN] [1422491401.437639422]: CANbus::request_positions: send failed: CAN_Write failed (0x80): write buffer full?
[FATAL] [1422491401.437778078]: control_loop: request_positions failed
[ERROR] [1422491401.437857530]: Control loop finished
[FATAL] [1422491401.437927432]: Destroying class CANbus
[FATAL] [1422491401.438042734]: ControlLoop::stop: pthread_join failed.
[Thread 0x7fffee35f700 (LWP 5649) exited]
jeking04 commented 9 years ago

Here is the log from the head, which falted shortly after:

[ INFO] [1422490313.057404579]: Trajectory 19495cff has finished; new reference position is [ 0.0000 -0.3000 ]
[ WARN] [1422491401.499991089]: CANbus::request_positions: send failed: CAN_Write failed (0x80): write buffer full?
[FATAL] [1422491401.500046098]: control_loop: request_positions failed
[ WARN] [1422491401.500203945]: CANbus::set_property: send failed: CAN_Write failed (0x80): write buffer full?
[FATAL] [1422491401.500234465]: Destroying class CANbus
[FATAL] [1422491401.500283184]: ControlLoop::stop: pthread_join failed.
[Thread 0x7fffee35f700 (LWP 5175) exited]
mkoval commented 9 years ago

We just saw the arm die, then had the head die around 700 ms later with the same error. This would be consistent with it being an issue with the CAN card or the the PEAK kernel driver.

@vandeweg Do you have any other ideas/suggestions that we can try?

vandeweg commented 9 years ago

Darn, I was really hoping it was that particular card, but I guess not. We didn’t see these kinds of failures when we were using the Peak PCMCIA card on the Dell laptops, so I always suspected the new card. But the reality is that a lot of things have changed: platform (including BIOS), OS, and device driver version. I don’t really know where you should start. When Mike1 ran the arms for a while with a 2-channel Peak card on a desktop machine he didn’t have any errors, so I guess you could start with the 2-channel card in the onboard computer and see if it makes a difference.

One thing about Peak is they make heavy use of the CPU interrupts to get data out of their cards. That seemed like it was a problem on Chimp, where we eventually abandoned Peak altogether. One thing to consider would be biting the bullet and switching to a CANbus interface from a different manufacturer. We’ve been happy with the IXXAT cards on Chimp, and even though the API is quite different I could probably get permission to share the code.

Mike

On Jan 28, 2015, at 7:37 PM, Michael Koval notifications@github.com wrote:

We just saw the arm die, then had the head die around 700 ms later with the same error. This would be consistent with it being an issue with the CAN card or the the PEAK kernel driver.

@vandeweg Do you have any other ideas/suggestions that we can try?

— Reply to this email directly or view it on GitHub.

jeking04 commented 9 years ago

Since I have been the bearer of bad news for the better part of the day, I thought I would end with some encouragment. We have run the demo many many times and have only seen 1 fault in the last 5 hours.