Open jeking04 opened 9 years ago
According to /usr/include/pcan.h
, this error code is an overrun in receive buffer. This is a bit surprising because OWD prints that the write buffer is full:
#define CAN_ERR_OVERRUN 0x0002 // overrun in receive buffer
The “errors” column in /proc/pcan is a count of errors, not the error flag.
Mike
On Jan 27, 2015, at 7:21 PM, Michael Koval notifications@github.com wrote:
According to /usr/include/pcan.h, this error code is an overrun in receive buffer. This is a bit surprising because OWD prints that the write buffer is full:
define CAN_ERR_OVERRUN 0x0002 // overrun in receive buffer
— Reply to this email directly or view it on GitHub.
I should have known that. I always mistakenly think it's an error code because it's printed in hex.
In any case, due to the bug I just fixed in 4a799a37d5c8b4668c7dbbecc7d783e409a7fd38, we don't really know if the arm is faulting due to a "write buffer full" error or not. The driver actually prints "write buffer full," with the corresponding error code, when any write error occurs.
We failed to capture the output from OWD yesterday. Here it is from a fault today:
[ INFO] [1422465567.586704246]: Position released by SetStiffness command
[ WARN] [1422465575.609377420]: CANbus::request_positions: send failed: CAN_Write failed (0x80): write buffer full?
[FATAL] [1422465575.609418891]: control_loop: request_positions failed
[ERROR] [1422465575.609443519]: Control loop finished
[FATAL] [1422465575.609469378]: Destroying class CANbus
[FATAL] [1422465575.609516053]: ControlLoop::stop: pthread_join failed.
[Thread 0x7fffee35f700 (LWP 3989) exited]
I will pull in the patch and post an update next time we see the failure.
Error code 0x00
is CAN_ERR_QXMTFULL
:
#define CAN_ERR_QXMTFULL 0x0080 // transmit queue full
This is indeed the error that triggers the bug fixed in 4a799a3. It will be good to see what the actual error code is, once we apply the patch.
Here is an update:
[ INFO] [1422467849.426043494]: Position released by SetStiffness command
[ WARN] [1422467850.240028258]: CANbus::request_property: send failed: CAN_Write failed (0x80): write buffer full?
[ WARN] [1422467850.240106553]: Failed to request THERM from hand pucks: CAN_Write failed (0x80): write buffer full?
[FATAL] [1422467850.240139752]: control_loop: request_hand_state_rt failed
[ERROR] [1422467850.240191985]: Control loop finished
[FATAL] [1422467850.240226960]: Destroying class CANbus
[FATAL] [1422467850.240276748]: ControlLoop::stop: pthread_join failed.
[Thread 0x7fffee35f700 (LWP 5958) exited]
Capturing more to keep a record as we debug:
[ INFO] [1422480267.232514099]: Added trajectory 75c6c33a
[ WARN] [1422480268.581974133]: CANbus::request_positions: send failed: CAN_Write failed (0x80): write buffer full?
[FATAL] [1422480268.582012020]: control_loop: request_positions failed
[ERROR] [1422480268.582041021]: Control loop finished
[FATAL] [1422480268.582077779]: Destroying class CANbus
[FATAL] [1422480268.582118652]: ControlLoop::stop: pthread_join failed.
[Thread 0x7fffee35f700 (LWP 9090) exited]
I'm pretty certain these errors are tied to hardware failures on the Peak card. Did Mike1 ever replace the Peak card? I think he had been waiting for the WAMs to get back from Barrett. I've always been very suspicious of the first four-port card he was using when he did the big Herb upgrade.
Mike
On 01/28/2015 04:40 PM, Jennifer King wrote:
Capturing more to keep a record as we debug:
[ INFO] [1422480267.232514099]: Added trajectory 75c6c33a [ WARN] [1422480268.581974133]: CANbus::request_positions: send failed: CAN_Write failed (0x80): write buffer full? [FATAL] [1422480268.582012020]: control_loop: request_positions failed [ERROR] [1422480268.582041021]: Control loop finished [FATAL] [1422480268.582077779]: Destroying class CANbus [FATAL] [1422480268.582118652]: ControlLoop::stop: pthread_join failed. [Thread 0x7fffee35f700 (LWP 9090) exited]
— Reply to this email directly or view it on GitHub https://github.com/personalrobotics/owd/issues/2#issuecomment-71921926.
I switched to the new CAN card a few weeks ago, after we got the right arm back from Barrett. I was careful to avoid ESD damage while installing it. I also covered the exposed contacts on the top of the card with Kapton tape (like Mike1 did with the previous card) since there wasn't much clearance with the top of the case.
This card has only been used with the repaired arm and the head. There is no way that it could have been damaged by the electrical problems with HERB's left arm.
To be clear, this is exactly the same model number as the old card. It's just a drop-in replacement.
[ INFO] [1422491370.019571923]: Trajectory 3352255a has finished; new reference position is [ 3.6800 -1.9000 0.0000 2.2000 0.0000 0.0000 0.0000 ]
[ WARN] [1422491401.437639422]: CANbus::request_positions: send failed: CAN_Write failed (0x80): write buffer full?
[FATAL] [1422491401.437778078]: control_loop: request_positions failed
[ERROR] [1422491401.437857530]: Control loop finished
[FATAL] [1422491401.437927432]: Destroying class CANbus
[FATAL] [1422491401.438042734]: ControlLoop::stop: pthread_join failed.
[Thread 0x7fffee35f700 (LWP 5649) exited]
Here is the log from the head, which falted shortly after:
[ INFO] [1422490313.057404579]: Trajectory 19495cff has finished; new reference position is [ 0.0000 -0.3000 ]
[ WARN] [1422491401.499991089]: CANbus::request_positions: send failed: CAN_Write failed (0x80): write buffer full?
[FATAL] [1422491401.500046098]: control_loop: request_positions failed
[ WARN] [1422491401.500203945]: CANbus::set_property: send failed: CAN_Write failed (0x80): write buffer full?
[FATAL] [1422491401.500234465]: Destroying class CANbus
[FATAL] [1422491401.500283184]: ControlLoop::stop: pthread_join failed.
[Thread 0x7fffee35f700 (LWP 5175) exited]
We just saw the arm die, then had the head die around 700 ms later with the same error. This would be consistent with it being an issue with the CAN card or the the PEAK kernel driver.
@vandeweg Do you have any other ideas/suggestions that we can try?
Darn, I was really hoping it was that particular card, but I guess not. We didn’t see these kinds of failures when we were using the Peak PCMCIA card on the Dell laptops, so I always suspected the new card. But the reality is that a lot of things have changed: platform (including BIOS), OS, and device driver version. I don’t really know where you should start. When Mike1 ran the arms for a while with a 2-channel Peak card on a desktop machine he didn’t have any errors, so I guess you could start with the 2-channel card in the onboard computer and see if it makes a difference.
One thing about Peak is they make heavy use of the CPU interrupts to get data out of their cards. That seemed like it was a problem on Chimp, where we eventually abandoned Peak altogether. One thing to consider would be biting the bullet and switching to a CANbus interface from a different manufacturer. We’ve been happy with the IXXAT cards on Chimp, and even though the API is quite different I could probably get permission to share the code.
Mike
On Jan 28, 2015, at 7:37 PM, Michael Koval notifications@github.com wrote:
We just saw the arm die, then had the head die around 700 ms later with the same error. This would be consistent with it being an issue with the CAN card or the the PEAK kernel driver.
@vandeweg Do you have any other ideas/suggestions that we can try?
— Reply to this email directly or view it on GitHub.
Since I have been the bearer of bad news for the better part of the day, I thought I would end with some encouragment. We have run the demo many many times and have only seen 1 fault in the last 5 hours.
We have seen this twice today.
Here is the output of cat /proc/pcan