hartkopp / can-isotp

Linux Kernel Module for ISO 15765-2:2016 CAN transport protocol PLEASE NOTE: This module is part of the mainline Linux kernel since version 5.10
Other
239 stars 69 forks source link

Unrecoverable error when using python socket with CAN_ISOTP. #56

Closed tlalexander closed 1 year ago

tlalexander commented 1 year ago

Hello and thank you for your work on this kernel module. I appreciate being able to make a high quality system with it.

I have an issue that is probably a python issue, but I am hoping you may know something about it. I have a full write up at the following link: https://github.com/pylessard/python-can-isotp/issues/80

Basic details: I am using the python-can-isotp library as a helper library to create a connection to the socketcan interface on my raspberry pi CM4 with MCP2515 chip onboard. This kernel module is running on the raspberry pi.

Everything works fine with a short delay between new calls to send, but if I send data too quickly the python socket will eventually enter a condition where it starts exclusively throwing socket.timeout errors on every call to send, never sending another packet again. If I try to close the socket the program locks up. During this error condition I am still able to send frames over that same CAN connection using isotpsend from another terminal window even as the python program repeatedly throws socket.timeout errors.

If I create the socket with the WAIT_TX_DONE flag, then when this error occurs it will lock up my entire python program and I will have to kill the program from another terminal. The timeout is only 100 milliseconds so I don't understand why this would lock up, and if this is a bug or I am simply misusing it.

Since I can use isotpsend from another terminal window during this lockup, this shows me that the actual CAN link is fine. So the problem should be between your kernel module and the python socket. Thus I should probably ask the python-help list for assistance, but I worry that concerns about CAN bus may be too specialized, or that such an email would only get routed to you anyway.

Here is what the system says about the link when my python program is locked up:

acorn@acornv2:~$ ip -details -statistics link show can1
4: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 20000
    link/can  promiscuity 0 minmtu 0 maxmtu 0 
    can state ERROR-ACTIVE restart-ms 0 
      bitrate 500000 sample-point 0.875 
      tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
      mcp251x: tseg1 3..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1
      clock 8000000 
      re-started bus-errors arbit-lost error-warn error-pass bus-off
      0          0          0          0          0          0         numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
    RX: bytes  packets  errors  dropped missed  mcast   
    1509504    188688   0       0       0       0       
    TX: bytes  packets  errors  dropped carrier collsns 
    6484689    891876   0       0       0       0   

I would like to find out what is actually happening when it locks up, but I don't know how to debug that. I would also like to know how I can check the socket to confirm it is safe to send without having to hard code a wait, and finally I would like to know how to recover from this condition.

Any and all help is greatly appreciated. If I should take my question to python-help please let me know. Thank you!

hartkopp commented 1 year ago

Hi, can you please tell me the Linux kernel version you are working with (uname -a)? Is there anything visible in the kernel log, e.g. some WARN splat (see with dmesg)?

I have a rough idea what could have gone wrong.

One thing that is vital to the WAIT_TX_DONE flag: When the PDU has a problem to be send then the ISO15765-2 timer usually catches the timeout (e.g. missing flow control frame). This timer is 1000ms. So killing the process before that time or having timeouts < 1100ms would not make sense. But in any case we should not run into a lockup ...

Do you have a log file created with candump -l which either shows the good and the bad case? Maybe your receiving node is not capable to handle the first frame in the no-delay case.

Probably isotpdump can1 -s <can_id1> -d <can_id1> -td -c could also help to identify the missing frames.

tlalexander commented 1 year ago

Hi thank you! Here's what you've requested.

Linux acornv2 5.15.76-v8+ #1597 SMP PREEMPT Fri Nov 4 12:16:41 GMT 2022 aarch64 GNU/Linux

Nothing in dmesg that looks bad to me from initialization of CAN ports until now:

[    6.724382] mcp251x spi0.1 can0: MCP2515 successfully initialized.
[    6.819687] mcp251x spi0.0 can1: MCP2515 successfully initialized.
[    6.843706] spi1.1: ttySC1 at I/O 0x1 (irq = 67, base_baud = 921600) is a SC16IS752
[    7.197605] spi1.0: ttySC3 at I/O 0x1 (irq = 69, base_baud = 921600) is a SC16IS752
[    7.405006] IPv6: ADDRCONF(NETDEV_CHANGE): can1: link becomes ready
[    8.413817] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
[    9.629957] bcmgenet fd580000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
[    9.629999] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   12.704746] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[   12.711020] Bridge firewalling registered
[   13.060545] Initializing XFRM netlink socket
[   16.093994] vc4-drm gpu: [drm] Cannot find any crtc or sizes
[   33.757859] cam-dummy-reg: disabling
[  369.854926] can: controller area network core
[  369.855038] NET: Registered PF_CAN protocol family
[  369.865952] can: isotp protocol
[19231.256268] can: raw protocol

It is interesting that you say a timeout of 0.1 does not make sense. It's just the default in python-can-isotp, so maybe that should be corrected. As a user, since transmission should succeed in a few milliseconds in my case, a functional 0.1 timeout would be nice if possible. https://github.com/pylessard/python-can-isotp/blob/master/isotp/tpsock/__init__.py#L70

As far as I can tell, candump did not show the problem. It seems like when the problem occurs, there is simply no entry in candump. isotpdump did capture something interesting, an incomplete packet in one instance. however that doesn't always show up.

(0.000371)  can1  456  [8]  [FF] ln: 20   data: 00 00 00 00 00 00 
(0.002614)  can1  123  [8]  [FC] FC: 0 = CTS # BS: 0 = off # STmin: 0x00 = 0 ms
(0.000398)  can1  456  [8]  [CF] sn: 1    data: 00 00 00 00 00 00 00 
(0.000372)  can1  456  [8]  [CF] sn: 2    data: 00 00 00 00 00 00 00 
(0.000371)  can1  456  [8]  [FF] ln: 20   data: 00 00 00 00 00 00 
(0.002626)  can1  123  [8]  [FC] FC: 0 = CTS # BS: 0 = off # STmin: 0x00 = 0 ms
(0.003558)  can1  456  [8]  [FF] ln: 20   data: 00 00 00 00 00 00 
(0.000677)  can1  123  [8]  [FC] FC: 0 = CTS # BS: 0 = off # STmin: 0x00 = 0 ms
(0.000408)  can1  456  [8]  [CF] sn: 1    data: 00 00 00 00 00 00 00 
(0.000372)  can1  456  [8]  [CF] sn: 2    data: 00 00 00 00 00 00 00 
(0.000687)  can1  456  [8]  [FF] ln: 20   data: 00 00 00 00 00 00 
(0.002351)  can1  123  [8]  [FC] FC: 0 = CTS # BS: 0 = off # STmin: 0x00 = 0 ms
(0.000410)  can1  456  [8]  [CF] sn: 1    data: 00 00 00 00 00 00 00 
(0.000382)  can1  456  [8]  [CF] sn: 2    data: 00 00 00 00 00 00 00 

Detailed logs attached.

The receiver is an RP2040 CPU running at 133MHz, hooked up to an MCP2515 with a 12Mhz clock (I would prefer 16Mhz but my supplier screwed up.) I had to figure out my own CFG values for the chip due to the 12MHz clock, and I used this online calculator, which gives me the following parameters:

image

I did assume the MCP2510 uses the same values as the MCP2515, since that isn't in the calculator. I used the values from the first row, and included a few more rows in the screenshot in case there are better values I should be using. I don't understand which values are best, though I did skim the datasheet section enough to see what they are doing. This is the calculator: https://www.kvaser.com/support/calculators/bit-timing-calculator/

I don't know if you have any intuition about those receiver specs, but I include them for completeness. The receiver is running a very tight loop with only the isotp receive call and a print statement. It does not change the behavior if I remove the print statement.

Bed time for me now. Thank you so much for taking a look! Can bus is pretty neat. All of our hardware is open source too, designed in Kicad. So any work to fix this will go a long way. Cheers!

candump_normal.log candump-problem.log isotpdump.log isotpdump2.log isotpdump3.log isotpdump_good.log

hartkopp commented 1 year ago

I was able to reproduce the missing transmission of (consecutive) data frames. Either on Linux 5.10 and 6.2-rc2.

(1673025920.176185)  can0  456  [8]  [FF] ln: 20   data: 01 02 03 04 05 06 
(1673025920.178378)  can0  123  [3]  [FC] FC: 0 = CTS # BS: 0 = off # STmin: 0x00 = 0 ms
(1673025920.178503)  can0  456  [8]  [CF] sn: 1    data: 07 08 09 0A 0B 0C 0D 
(1673025920.178563)  can0  456  [8]  [CF] sn: 2    data: 0E 0F 10 11 12 13 14 
(1673025920.178597)  can0  456  [8]  [FF] ln: 20   data: 01 02 03 04 05 06 
(1673025920.180812)  can0  123  [3]  [FC] FC: 0 = CTS # BS: 0 = off # STmin: 0x00 = 0 ms (<- signal CTRL-C)
(1673025920.729768)  can0  456  [8]  [FF] ln: 20   data: 01 02 03 04 05 06 
(1673025920.732406)  can0  123  [3]  [FC] FC: 0 = CTS # BS: 0 = off # STmin: 0x00 = 0 ms
(1673025920.732551)  can0  456  [8]  [CF] sn: 1    data: 07 08 09 0A 0B 0C 0D 
(1673025920.732618)  can0  456  [8]  [CF] sn: 2    data: 0E 0F 10 11 12 13 14 
(1673025920.732680)  can0  456  [8]  [FF] ln: 20   data: 01 02 03 04 05 06 
(1673025920.735353)  can0  123  [3]  [FC] FC: 0 = CTS # BS: 0 = off # STmin: 0x00 = 0 ms
(1673025920.735511)  can0  456  [8]  [CF] sn: 1    data: 07 08 09 0A 0B 0C 0D 
(1673025920.735590)  can0  456  [8]  [CF] sn: 2    data: 0E 0F 10 11 12 13 14

Using

on two different laptops connected via CAN USB adapters.

The transmission of the 20 byte PDUs runs successfully until the sending process is terminated with CTRL-C.

This seems to be the only way to stop the PDU transmission after the FF has been sent - no matter whether WAIT_TX_DONE was set (isotpsend option -b) or not.

Is it possible that the timeout handling on userspace/python level sends termination signals into the open isotp socket?

tlalexander commented 1 year ago

Oh gosh, I have discovered that I made an error that explains half of my problem, and unblocks my application development. I used the isotp flag incorrectly. This is incorrect: s = isotp.socket(isotp.socket.flags.WAIT_TX_DONE) WAIT_TX_DONE is a flag which is actually an int. If you pass an int to that constructor, it will use that int as the timeout. So all I was doing was setting a long timeout. The lockup behavior I was experiencing was actually the timeout spinning. It does in fact return after the timeout, but using the flag incorrectly in that way sets the timeout to 1024 seconds. The correct way to use the flag is with set_opts after constructing the socket, for example like this:

s = isotp.socket(timeout=2.0)
s.set_opts(isotp.socket.flags.WAIT_TX_DONE)

I have discovered that if I set that flag properly, I can completely eliminate my sleep between sends and the program does not crash. This unblocks my application development and reminds me that when I am stuck I should sleep on the problem.

However there is still a behavior I don't understand if I am not using the WAIT_TX_DONE flag. Without that flag, if I send data too fast I will hit the timeout, and then all subsequent calls to send will fail. So for this code:

import isotp
import time
import traceback

s = isotp.socket(timeout=2.0)
s.bind("can1", isotp.Address(rxid=0x123, txid=0x456))

ticktime=time.time()
while True:
    try:
        s.send(bytearray(20))
    except Exception as e:
        traceback.print_exc()
        print(time.time()-ticktime)
        ticktime=time.time()
    # If this sleep is 0.003 or less, I get errors.
    time.sleep(0.002)

If I run the above code, I will quickly hit a timeout, and then it will loop and repeatedly hit the timeout, never sending a packet again. You would think that after the two second timeout it would be able to send more data, but it does not. I checked and even if I send a good packet with isotp send from another terminal window, the python program still never sends a good packet, so I presume it is not an issue of bad state with my receiver.

So that behavior seems peculiar to me, and did lead to some confusion, but I can avoid it with proper use of the flag.

It seems slightly unhelpful to me that timeout behaves like a program lockup and cannot be exited with CTRL-C, but that is just setting socket.settimeout(timeout) on the low level python socket, so I am not sure if that has much to do with ISOTP.

As far as your findings hartkopp, it seems that the behavior of my program does not change whether the timeout is 0.1 or if it is 2.0 or more, so the only real issue I have encountered is that after it hits the timeout once, it never recovers. I am curious for your thoughts on that. Thanks for the help and I apologize for my error!

hartkopp commented 1 year ago

Oh gosh, I have discovered that I made an error that explains half of my problem, and unblocks my application development. I used the isotp flag incorrectly. This is incorrect: s = isotp.socket(isotp.socket.flags.WAIT_TX_DONE) WAIT_TX_DONE is a flag which is actually an int. If you pass an int to that constructor, it will use that int as the timeout. So all I was doing was setting a long timeout. The lockup behavior I was experiencing was actually the timeout spinning. It does in fact return after the timeout, but using the flag incorrectly in that way sets the timeout to 1024 seconds. The correct way to use the flag is with set_opts after constructing the socket, for example like this:

s = isotp.socket(timeout=2.0)
s.set_opts(isotp.socket.flags.WAIT_TX_DONE)

Looks better ;-)

I have discovered that if I set that flag properly, I can completely eliminate my sleep between sends and the program does not crash. This unblocks my application development and reminds me that when I am stuck I should sleep on the problem.

I feel you.

However there is still a behavior I don't understand if I am not using the WAIT_TX_DONE flag. Without that flag, if I send data too fast I will hit the timeout, and then all subsequent calls to send will fail. So for this code:

import isotp
import time
import traceback

s = isotp.socket(timeout=2.0)
s.bind("can1", isotp.Address(rxid=0x123, txid=0x456))

ticktime=time.time()
while True:
    try:
        s.send(bytearray(20))
    except Exception as e:
        traceback.print_exc()
        print(time.time()-ticktime)
        ticktime=time.time()
    # If this sleep is 0.003 or less, I get errors.
    time.sleep(0.002)

If I run the above code, I will quickly hit a timeout, and then it will loop and repeatedly hit the timeout, never sending a packet again. You would think that after the two second timeout it would be able to send more data, but it does not. I checked and even if I send a good packet with isotp send from another terminal window, the python program still never sends a good packet, so I presume it is not an issue of bad state with my receiver.

So that behavior seems peculiar to me, and did lead to some confusion, but I can avoid it with proper use of the flag.

It seems slightly unhelpful to me that timeout behaves like a program lockup and cannot be exited with CTRL-C, but that is just setting socket.settimeout(timeout) on the low level python socket, so I am not sure if that has much to do with ISOTP.

I'm not really sure about isotp.socket(timeout=2.0)

When does this timeout trigger?

Without the WAIT_TX_DONE flag the send() syscall directly returns the numbers of bytes to write (here: 20).

It is assumed that the transmission succeeds. When the next send() syscall is started this call is blocked until the first (ongoing) transmission has been finished (which might take some longer time than 2 seconds!).

When there was an error on the first transmission the second syscall returns the error code of that failure. The second PDU is lost.

From this point the socket is ready to get and process the third PDU. So how does the timeout cope with this behavior?

For that reason WAIT_TX_DONE (blocking until the transmission is finished) makes more sense as it directly returns the failure to the specific transmission.

As far as your findings hartkopp, it seems that the behavior of my program does not change whether the timeout is 0.1 or if it is 2.0 or more, so the only real issue I have encountered is that after it hits the timeout once, it never recovers. I am curious for your thoughts on that. Thanks for the help and I apologize for my error!

tlalexander commented 1 year ago

Thanks for the reply.

The timeout triggers when I transmit too quickly without the WAIT_TX_DONE flag. I am not sure exactly what is happening, but if I send data too fast using my loop above, I will send a few packets and then quickly have some kind of issue, and timeout. Then all subsequent calls to send will fail with the same timeout. It feels like that is an error, since the transmissions only take a few milliseconds, and after a two second timeout it should be able to transmit again. I am happy to investigate this further but I am not sure what to check. Maybe there is an issue with the timeout behavior.

hartkopp commented 1 year ago

The timeout triggers when I transmit too quickly without the WAIT_TX_DONE flag.

Ok, good. That fits my expectations.

I am not sure exactly what is happening, but if I send data too fast using my loop above, I will send a few packets and then quickly have some kind of issue, and timeout. Then all subsequent calls to send will fail with the same timeout. It feels like that is an error, since the transmissions only take a few milliseconds, and after a two second timeout it should be able to transmit again.

As described in my last answer the socket provides the first error to the second syscall. So it is not about time (timeout) but about the following send() syscall.

Using the socket without WAIT_TX_DONE was a (too) positive assumption that the transmission will not run into errors and that after an error the programmer is closing the socket. This is some ancient behaviour from the initial implementation 10 years ago. When the driver got into the Linux Mainline kernel I was not sure if we should really block the userspace application when sending the first PDU and I also got no feedback on that from the community. So it stayed in the code.

I am happy to investigate this further but I am not sure what to check. Maybe there is an issue with the timeout behavior.

Thanks but I think, there is nothing more for you to investigate. Your descriptions helped a lot to clear this up. When enabling WAIT_TX_DONE it should just work fine in your setup.

Many thanks, Oliver

tlalexander commented 1 year ago

Ah okay. So if I do not set the flag, once I get the error I would have to close the socket. I think at one point I tried that, and the program locked up, unable to close the socket. It seemed basically impossible to recover from errors like that. But anyway, I will use the flag. Thanks for your help!