sebi2k1 / node-can

NodeJS SocketCAN extension
221 stars 72 forks source link

messages are not received when can bus gets into warning state #49

Closed kuebk closed 5 years ago

kuebk commented 5 years ago

Hello,

When I send a CAN frame and bus gets in warning state (TX err = 119) the onMessage callback is not launched. I see tx/rx frames with candump but can't handle them with socketcan.

Can you help me guys with this issue?

Thanks.

sebi2k1 commented 5 years ago

That sounds odd, socketcan is not considering the bus state of the controller at all. Did you tried the candump javascript example? Can you provide more info about the setup?

kuebk commented 5 years ago

My setup is node@0.12.15 + socketcan@2.2.2, hardware is 8devices USB2CAN working under ubuntu12. My script logic is pretty same as the one you posted as I can't develop it any further with this issue.

Same development under windows works as intended.

Let me know if I can provide any more details.

sebi2k1 commented 5 years ago

Well actually I think if you CAN run into warning state there is a serious issue on the topology anyhow (stub lengths, termination issues, interference or electrical issues on some nodes).

Anyhow, can you confirm a candump running in parallel works while socketcan stopped reporting messages?

I double check the documentation and I cannot find an explanation why socketcan should stop receiving frames if the controller is in warning state. Unfortuntely I do not have real hardware available so I cannot check that behaviour.

juleq commented 5 years ago

@sebi2k1 I have HW here and I just tested the following:

Setup: -Raspberry + latest Rasbian + PiCAN Duo Iso -node 10.15.0 + socketcan 2.4.0 -My application that streams live bus data to a web browser

Test: -Application generates 20% bus load/channel -Channels connected to each other -Looking at raw data stream... -Shorted high and low -Stream stopped, as expected -Opened the short again -Bus does not recover -Killed application -Restarted application -Bus does not recover -Killed application again -Did ip link set can0 down -Did ip link set can0 up -Did same for can1 -Started application -Bus did recover, bytes streaming by...

I do not know if there is already an asynchronous mechanism in place in the underlying socketcan linux driver to receive bus events. If not, I guess it would be required to poll the bus state via ioctl to implement such notifications to inform the application level? I also do not now if there is a less brutal method instead of down/up to recover from bus off/bus heavy and such...

If you want me to try anything with my setup here, please advise.

kuebk commented 5 years ago

Well actually I think if you CAN run into warning state there is a serious issue on the topology anyhow (stub lengths, termination issues, interference or electrical issues on some nodes).

I'm writing a bootloader which has to be send over CAN, maybe that's the case why bus gets into warning state - but that doesn't explain why it works under windows.

Anyhow, can you confirm a candump running in parallel works while socketcan stopped reporting messages?

candump from can-utils package shows CAN frames correctly.

I double check the documentation and I cannot find an explanation why socketcan should stop receiving frames if the controller is in warning state. Unfortuntely I do not have real hardware available so I cannot check that behaviour.

Didn't have much time to look at C, but from basic debug seems like it's somehow related to ioloop as the ioloop callback doesn't get called.

juleq commented 5 years ago

What were the command line options on your candump invokation?

sebi2k1 commented 5 years ago

@sebi2k1 I have HW here and I just tested the following: ... I do not know if there is already an asynchronous mechanism in place in the underlying socketcan linux driver to receive bus events. If not, I guess it would be required to poll the bus state via ioctl to implement such notifications to inform the application level? I also do not now if there is a less brutal method instead of down/up to recover from bus off/bus heavy and such...

By shortcutting CAN_L and CAN_H you are actually forcing the transmitting controller to bus off pretty quickly as it doesn't get even bus arbitration. Anyhow when entering bus off the controller enters a special state which needs to be handled correctly.

Every failed transmission attempt and bus state changes generate a so called error frame. They popup via onMessage callback with the message having a parameter "err" set to true. The error frame type and information is encoded in the CAN id and payload.

Snippet from include/linux/can/error.h:

/* error class (mask) in can_id */
#define CAN_ERR_TX_TIMEOUT   0x00000001U /* TX timeout (by netdevice driver) */
#define CAN_ERR_LOSTARB      0x00000002U /* lost arbitration    / data[0]    */
#define CAN_ERR_CRTL         0x00000004U /* controller problems / data[1]    */
#define CAN_ERR_PROT         0x00000008U /* protocol violations / data[2..3] */
#define CAN_ERR_TRX          0x00000010U /* transceiver status  / data[4]    */
#define CAN_ERR_ACK          0x00000020U /* received no ACK on transmission */
#define CAN_ERR_BUSOFF       0x00000040U /* bus off */
#define CAN_ERR_BUSERROR     0x00000080U /* bus error (may flood!) */
#define CAN_ERR_RESTARTED    0x00000100U /* controller restarted */

If you want me to try anything with my setup here, please advise.

There is actually not a lot to be done in node-can at all, as the socket API does not support recovery, this is done via netlink interface. What we can do is to improve the API towards the user to make it easier getting events like bus off.

If you want to recover from bus off state you simply need to configure the device to automatically recover from bus off state via "restart" parameter provide by ip util. Check help for more info. $ ip link set can0 type can help

sebi2k1 commented 5 years ago

Well actually I think if you CAN run into warning state there is a serious issue on the topology anyhow (stub lengths, termination issues, interference or electrical issues on some nodes).

I'm writing a bootloader which has to be send over CAN, maybe that's the case why bus gets into warning state - but that doesn't explain why it works under windows.

That is not the cause, CAN is not working like that. Please take a multimeter and measure the resistance between CAN_H and CAN_L, the bus should be terminated with 120ohm on each side which should result in 60ohm resistance. This way you avoid any reflections interfering the communication. Some boards even have built-in termination resistors so measure first before applying additional resistors.

You can also let candump running in parallel and filter only for errors frames, this should give more insight what goes wrong.

Anyhow, can you confirm a candump running in parallel works while socketcan stopped reporting messages?

candump from can-utils package shows CAN frames correctly.

I double check the documentation and I cannot find an explanation why socketcan should stop receiving frames if the controller is in warning state. Unfortuntely I do not have real hardware available so I cannot check that behaviour.

Didn't have much time to look at C, but from basic debug seems like it's somehow related to ioloop as the ioloop callback doesn't get called.

Sounds reasonable as the controller is likely not forwarding any frames anymore, I guess it even entered bus off already. If you reached that state please provide the output of: $ ip -details -statistics link show can0

juleq commented 5 years ago

Thank you for your insights. Since handling the error states should not be critical to performance, I guess it would be fine to dissect the error frames in js. I‘d see two options for that: Leaving it to the user or pull that into the socketcan.js and offer an error event. If I find a moment I will experiment with that.

I quess to access the counter values I would have to ask for them on the netlink interface myself?

juleq commented 5 years ago

I have set the restart-ms parameter to 100 and dumped the error frame onMessage. It works exactly as @sebi2k1 has pointed out for me:

-Loads of dumped error frames while shorted -Automatic bus recovery when short is lifted

If the multimeter does not deliver an answer, I‘d probably check oscillator configuration and sample point next.

sebi2k1 commented 5 years ago

Thank you for your insights. Since handling the error states should not be critical to performance, I guess it would be fine to dissect the error frames in js. I‘d see two options for that: Leaving it to the user or pull that into the socketcan.js and offer an error event. If I find a moment I will experiment with that.

Even though the controller can generate a ton of error frames it is not performance critical as it is a very rare condition.

I quess to access the counter values I would have to ask for them on the netlink interface myself?

Correct, netlink supports it. The error frames reported via socket API may also contain controller specific bits and pieces but it is not a generic approach.

I think for the beginning it might be helpful to just indicate warning, passive and bus off level:

/* error class (mask) in can_id */
...
#define CAN_ERR_CRTL         0x00000004U /* controller problems / data[1]    */
...
#define CAN_ERR_BUSOFF       0x00000040U /* bus off */
...
/* error status of CAN-controller / data[1] */
...
#define CAN_ERR_CRTL_RX_WARNING  0x04 /* reached warning level for RX errors */
#define CAN_ERR_CRTL_TX_WARNING  0x08 /* reached warning level for TX errors */
#define CAN_ERR_CRTL_RX_PASSIVE  0x10 /* reached error passive status RX */
#define CAN_ERR_CRTL_TX_PASSIVE  0x20 /* reached error passive status TX */
                      /* (at least one error counter exceeds */
                      /* the protocol-defined level of 127)  */
juleq commented 5 years ago

I assume that the error frames do not indicate when the bus recovers. My approach would be to latch an error state for, say, one second so that it vanishes again when the bus is back to normal.

In a production system these bus errors are very infrequent once the commissioning phase has been concluded.

One of my use cases for socketcan is supporting EOL testing (small series or prototypes) and commissioning and there somebody missing out on termination or swapping H and L is quite common. So indicating the respective error and automatic recovery will spare me a load of phone calls :). Thank you.

sebi2k1 commented 5 years ago

Most of the error frames are actually just error interrupts of the underlying CAN controller. So the error frame itself can be used to detect the transitions from warning->passive->bus off. You can also use the error frames to detect changes during recovery from passive -> warning -> normal. The only problem I see is bus off, but SocketCAN reports CAN_ERR_RESTARTED to indicate a restart (if triggered automatically), so the controller is back in normal and it starts all over again. So I would say there is no necessity to poll the bus state. Not even netlink is required except you want to explicitly show the rx/tx error counters of the controller.

Depending on the controller/transceiver you may check for the transceiver state (see CAN_ERR_TRX). It can detect shortcuts or open wires, may help during commissioning.

If you just leave the wires open and send a message the controller should enter error passive, so it is a good test case.

kuebk commented 5 years ago

What were the command line options on your candump invokation?

kuebk@czubix: ~/can/can-utils (master) $ ./candump -t A -c -a -d -e -x can0 (2019-01-22 18:45:42.912147) can0 TX B E 123 [8] 55 55 00 01 9C 01 00 02 'UU......' (2019-01-22 18:45:42.916775) can0 RX - - 040 [4] 41 3E 00 01 'A>..'

That is not the cause, CAN is not working like that. Please take a multimeter and measure the resistance between CAN_H and CAN_L, the bus should be terminated with 120ohm on each side which should result in 60ohm resistance. This way you avoid any reflections interfering the communication. Some boards even have built-in termination resistors so measure first before applying additional resistors.

If problem is within my hardware setup then it would bring exacly same issues no matter if I work on Windows or Linux.

Sounds reasonable as the controller is likely not forwarding any frames anymore, I guess it even entered bus off already. If you reached that state please provide the output of: $ ip -details -statistics link show can0

kuebk@czubix: ~/can/can-utils (master) $ ip -details -statistics link show can0 32: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UNKNOWN qlen 10 link/can can state ERROR-ACTIVE (berr-counter tx 119 rx 0) restart-ms 0 bitrate 500000 sample-point 0.875 tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1 usb_8dev: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..1024 brp-inc 1 clock 32000000 re-started bus-errors arbit-lost error-warn error-pass bus-off 0 2 0 2 1 0 RX: bytes packets errors dropped overrun mcast 52 7 0 0 0 0 TX: bytes packets errors dropped carrier collsns 16 2 2 0 0 0

juleq commented 5 years ago

The stats say that the controller has never been able to tx either of the two frames to the bus. (Edit: dropped is zero, so that might have been a little quick. Nevertheless, the tx counter has been increased a few times and the controller has transitioned to error active. I do not think that you could achieve that using the socketcan linux userland API and that would rule out node-can).

How many other nodes do you have on the bus? What do they look like (platform, interface)?

If you can rule out electrical bus issues and you have at least one further node (not set to listen only) to acknowledge, it should be able to transmit.

Are you able to transmit reliably using cansend? I would say do a few sends and a few receives without even starting your node js application and post the resulting statistics again. If they look funny, it leaves the interface linux driver and the bus hardware as probable offenders, I‘d say.

sebi2k1 commented 5 years ago

I agree with @juleq. There is indeed something wrong with your network. You just exchange a handful of frames but the controller change to error active.

You may share the topology and placement of termination if you like to have a comment.

kuebk commented 5 years ago

How many other nodes do you have on the bus? What do they look like (platform, interface)?

There is only one node on the bus, its Engine Control Unit based on Infineon Tricore TC1796

Are you able to transmit reliably using cansend? I would say do a few sends and a few receives without even starting your node js application and post the resulting statistics again. If they look funny, it leaves the interface linux driver and the bus hardware as probable offenders, I‘d say.

34: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UNKNOWN qlen 10 link/can can state ERROR-ACTIVE (berr-counter tx 119 rx 0) restart-ms 0 bitrate 500000 sample-point 0.875 tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1 usb_8dev: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..1024 brp-inc 1 clock 32000000 re-started bus-errors arbit-lost error-warn error-pass bus-off 0 2 0 2 0 0 RX: bytes packets errors dropped overrun mcast 23684 3037 0 0 0 0 TX: bytes packets errors dropped carrier collsns 80 10 2 0 0 0

You may share the topology and placement of termination if you like to have a comment.

120ohm on both ends, topology as above. I'm using exacly same setup for regular CAN transmission and it works fine, the problem is when I put ECU in bootloader mode - and that only happens under linux. Under windows I'm using CANAL.dll and bus behaves exacly the same after sending initial bootloader frame I'm able to receive in software response frame.

juleq commented 5 years ago

Is it your bootloader? Does it change baud rates or bit timings? It is not unusual for bootloaders to use an immutable baud rate while the application usually uses a configurable one.

Your sample point appears to be 87,5% in linux. Are you able to find out your sample point for the Windows setup?

Am 24.01.2019 um 21:28 schrieb Jakub Lekstan notifications@github.com:

How many other nodes do you have on the bus? What do they look like (platform, interface)?

There is only one node on the bus, its Engine Control Unit based on Infineon Tricore TC1796

Are you able to transmit reliably using cansend? I would say do a few sends and a few receives without even starting your node js application and post the resulting statistics again. If they look funny, it leaves the interface linux driver and the bus hardware as probable offenders, I‘d say.

34: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UNKNOWN qlen 10 link/can can state ERROR-ACTIVE (berr-counter tx 119 rx 0) restart-ms 0 bitrate 500000 sample-point 0.875 tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1 usb_8dev: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..1024 brp-inc 1 clock 32000000 re-started bus-errors arbit-lost error-warn error-pass bus-off 0 2 0 2 0 0 RX: bytes packets errors dropped overrun mcast 23684 3037 0 0 0 0 TX: bytes packets errors dropped carrier collsns 80 10 2 0 0 0

You may share the topology and placement of termination if you like to have a comment.

120ohm on both ends, topology as above. I'm using exacly same setup for regular CAN transmission and it works fine, the problem is when I put ECU in bootloader mode - and that only happens under linux. Under windows I'm using CANAL.dll and bus behaves exacly the same after sending initial bootloader frame I'm able to receive in software response frame.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

kuebk commented 5 years ago

This MCU doesn't have builtin bootloader, it allows you to upload your own bootloader via BSL (boot strap loader) via CAN directly to RAM and launch it. The process looks like this:

  1. run MCU in BSL mode,
  2. send init frame
  3. send bootloader
  4. run from RAM

The error happens after step 2 (sending init frame). One more thing, under Windows CAN bus gets into error state too, but no matter of that I'm able to receive and send frames.

juleq commented 5 years ago

I would experiment with the sample point next. It is easy to do and I could imagine, that your Windows setup is sampling „on the edge“ while your Linux setup is beyond.

E.g use something like (input your bitrate): ip link set can0 up type can bitrate 125000 sample-point 0.875

And move the sample point upwards and downwards. Send messages then. Observe the TX error counter after each change.

Care to share which MCU/ECU? Don’t have to, just curious:).

kuebk commented 5 years ago

It's Infineon Tricore TC1796, will try playing with sample-point - thanks. :)

juleq commented 5 years ago

A remark for their own CAN adapter states that this one will not allow High Speed CAN. Maybe others can be wonky, too?

The BSL detects the baud rate automatically. I suggest trying 125k for slower but more robust flashing.

https://www.infineon.com/dgdl/ap3213610_TriCore_AUDO_NG_Bootloader.pdf?fileId=db3a30431ddc9372011e29f5c22c4ca2

fabdrol commented 4 years ago

@sebi2k1 you wrote:

Every failed transmission attempt and bus state changes generate a so called error frame. They popup via onMessage callback with the message having a parameter "err" set to true. The error frame type and information is encoded in the CAN id and payload.

How do I do that? I don't see that in the docs or samples? Can you provide an example?