pylessard / python-udsoncan

Python implementation of UDS (ISO-14229) standard.
MIT License
575 stars 199 forks source link

Socket timeout when trying to send data on second socket #178

Closed lemeura closed 11 months ago

lemeura commented 11 months ago

Hi,

I'm currently facing a quite systematic issue when using 2 different sockets to send UDS data on the same bus.

The stack trace ends like this:

    IsoTPSocketConnection.send(self, data)
  File "/venv/lib/python3.8/site-packages/udsoncan/connections.py", line 65, in send
    self.specific_send(payload)
  File "/venv/lib/python3.8/site-packages/udsoncan/connections.py", line 321, in specific_send
    self.tpsock.send(payload)
  File "/venv/lib/python3.8/site-packages/isotp/tpsock/__init__.py", line 89, in send
    return self._socket.send(*args, **kwargs)
socket.timeout: timed out

In details, the aim of my project is to simulate UDS conversation then:

The 2 sockets are opened successfully, the first one sends the request, the second one raises an exception when trying to send.

This setup was working well with old kernels but with recent ones, the timeout is very frequent but not systematic.

I tried without success:

Could you help me to identify the component which introduced this "issue"? I'm not familiar with the different layers in action for this topic:

Thanks in advance, best regards,

pylessard commented 11 months ago

how do you initialize your socket?

@hartkopp: What does it means when send() time out?

pylessard commented 11 months ago

@lemeura Can you share a CAN log? Is there data being sent at all ?

lemeura commented 11 months ago

Hello,

my sockets are initialized like this: socket 1: IsoTPSocketConnection(interface="can0", rxid=467595842, txid=467550962, name="F", addressing_mode=1) socket2 : IsoTPSocketConnection(interface="can0", rxid=1901, txid=1869, name="D", addressing_mode=0)

Please find attached the corresponding candump in which the socket timeout occured during the reading of DiDs.

can.log

FI: The timeout does not occur systematically on the same DiD

Actually, I have to keep running an outdated kernel but it's not very safe as none of security updates can be applied.

Best regards,

pylessard commented 11 months ago

Interesting. I wonder if that could be a bug in the kernel module. Can you check if there's a correlation between the payload size and the issue? I noticed that your communication fails after a payload of 13 bytes which perfectly fits 2 can frames.

I will try to check this, but I will be not so available in the upcoming week. Maybe you can raise that issue in the kernel module repo : https://github.com/hartkopp/can-isotp

lumagi commented 11 months ago

I think this might be related to this bug report. It's a race condition in the kernel module that causes the send to time out if there's a second thread listening on the same socket. There's a patch for the bug here. Maybe you want to try this, but it involves recompiling the isotp module.

pylessard commented 11 months ago

If that is the case, @lemeura you can always switch to the pure python implementation until the patch is officially released. Or you can recompile the module like @lumagi proposed

lemeura commented 11 months ago

Hi @pylessard ,

The issue does not occur systematically on the same frame, sometimes it's failing from the first frame, sometimes it works up to the end (8 Did reading).

@lumagi , @pylessard , I will try to compile the module and I will let you know if it can fix the issue.

Thank you very much for your support.

lemeura commented 11 months ago

Hi,

Unfortunately, I didn't succeed in compiling the module, I faced an error mentioning that is useless as the module is part of kernel since 5.10.

FYI, I didn't apply the patch yet because the source code from master seems to not be the one from which the patch has been done.

Please find below, the outputs:

$ git clone https://github.com/hartkopp/can-isotp.git
Cloning into 'can-isotp'...
remote: Enumerating objects: 599, done.
remote: Counting objects: 100% (162/162), done.
remote: Compressing objects: 100% (50/50), done.
remote: Total 599 (delta 111), reused 136 (delta 96), pack-reused 437
Receiving objects: 100% (599/599), 120.91 KiB | 2.69 MiB/s, done.
Resolving deltas: 100% (242/242), done.

$ git log
commit 7626d0a0707391970080d493ce69638719938da7 (HEAD -> master, origin/master, origin/HEAD)
Author: Oliver Hartkopp <socketcan@hartkopp.net>
Date:   Fri May 20 08:16:13 2022 +0200

    isotp: bump out-of-tree version date

    Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>

$ uname -r
5.15.0-84-generic

$ make
make -C /lib/modules/5.15.0-84-generic/build M=~/Documents/can-isotp modules
make[1]: Entering directory '/usr/src/linux-headers-5.15.0-84-generic'
  CC [M]  ~/Documents/can-isotp/net/can/isotp.o
~/Documents/can-isotp/net/can/isotp.c:93:2: error: #error No need to compile this out-of-tree driver! ISO-TP is part of Linux Mainline kernel since Linux 5.10.
   93 | #error No need to compile this out-of-tree driver! ISO-TP is part of Linux Mainline kernel since Linux 5.10.
      |  ^~~~~
make[3]: *** [scripts/Makefile.build:297: ~/Documents/can-isotp/net/can/isotp.o] Error 1
make[2]: *** [scripts/Makefile.build:560: ~/Documents/can-isotp/net/can] Error 2
make[1]: *** [Makefile:1909: ~/Documents/can-isotp] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-5.15.0-84-generic'
make: *** [Makefile:21: modules] Error 2

Could you tell me if I'm working with the correct branch ? Did I miss any steps ?

hartkopp commented 11 months ago

how do you initialize your socket?

@hartkopp: What does it means when send() time out?

Usually when there was no FC answer to a FF first frame within 1000ms.

hartkopp commented 11 months ago

You should just read and follow the error messages ;-)

~/Documents/can-isotp/net/can/isotp.c:93:2: error: #error No need to compile this out-of-tree driver! ISO-TP is part of Linux Mainline kernel since Linux 5.10. 93 | #error No need to compile this out-of-tree driver! ISO-TP is part of Linux Mainline kernel since Linux 5.10.

Could you tell me if I'm working with the correct branch ? Did I miss any steps ?

There is no need to clone any can-isotp repository as the feature of CAN ISOTP sockets is "just there" in your 5.15 kernel.

hartkopp commented 11 months ago

I think this might be related to this bug report. It's a race condition in the kernel module that causes the send to time out if there's a second thread listening on the same socket. There's a patch for the bug here. Maybe you want to try this, but it involves recompiling the isotp module.

@pylessard are you using the epoll() or poll() syscalls within udsoncan? If not, this patch and its description does not apply to the problem IMO.

pylessard commented 11 months ago

@hartkopp : I'm using send() and recv(), that's all.

hartkopp commented 11 months ago

@hartkopp : I'm using send() and recv(), that's all.

That's weird. I will take a deeper look into the log file. Did you ever use the CAN_ISOTP_WAIT_TX_DONE flag from the isotp sockopts?

This flag causes a waiting for the PDU to be transmitted entirely (or returning a timeout error).

lumagi commented 11 months ago

Please also consider this. I originally did a bit of digging in the CPython source code because I just assumed Python would pass the send/recv calls directly to libC. But instead, it always configures the sockets to be nonblocking and uses poll to implement blocking semantics for its sockets.

Edit: this is the main loop that does the polling (or rather select) until the socket is ready.

hartkopp commented 11 months ago

Ok @lumagi , than it makes sense that the issue relates to your fix!

But the CPython implementation should not act against the users expectation and exchange syscalls without notice. I assume this is used for multi-threading.

pylessard commented 11 months ago

Oh, I must add that udsoncan.IsoTPSocketConnection defines a timeout for the socket by default. By calling settimeout(), it puts the socket into non-blocking mode. I'm not exactly sure if that make usage of epoll. I just read this morning what epoll was to be honest.

https://github.com/pylessard/python-udsoncan/blob/master/udsoncan/connections.py#L280 https://github.com/pylessard/python-can-isotp/blob/master/isotp/tpsock/__init__.py#L84

PankajJain5889 commented 11 months ago

> I think this might be related to [this bug report](https://lore.kernel.org/linux-can/04fd32bc-b1c4-b9c3-3f8b-7987704a1f85@hartkopp.net/T/#m9615a4ecbdb742749886af73414e476498c93d51). It's a race condition in the kernel module that causes the send to time out if there's a second thread listening on the same socket. There's a [patch for the bug here](https://lore.kernel.org/linux-can/96d31e8c-fa26-4e32-4c36-768981f20a54@hartkopp.net/T/#u). Maybe you want to try this, but it involves recompiling the isotp module. I was facing similar issue of socket timing out when sending frames over CAN when I moved to Ubuntu 20.04 and using 5.15 kernel. I saw that any kernel greater than 5.15.78 was facing the same issue I used this patch on 5.15.133 and 6.6 rc4 and rebuilt the kernel to verify and found that the said issue is resolved but since this is patch work any chance that above mentioned change will come to mainline and will be adopted in Ubuntu (fossa release)

lumagi commented 11 months ago

Thank you to @hartkopp for fast-tracking the patch. The commit that lead to the regression was introduced to the upstream kernel with release 6.3. That means Ubuntu must be backporting the patches to its LTS kernels. I would assume that they will do so as well for the patch that fixes the issue if it is accepted and released in the next upstream kernel version 6.7.

hartkopp commented 11 months ago

@lumagi we are currently in the 6.6-rc cycle. And when the patch hits the 6.6-rc tree it automatically gets tracked for the kernel.org LTS kernels that have the issue. So it's very likely that Ubuntu will also apply that patch to its LTS kernels.

lumagi commented 11 months ago

That's good to know, thanks for the clarification @hartkopp

pylessard commented 11 months ago

@lemeura : does the conversation above helps you?

lemeura commented 11 months ago

Hi all,

I would like to thank you very much for all these clarifications.

The fact that the patch is in the pipe for future integration in the kernel is a very good news, as actually I'm unable to distribute custom kernel on my different setups.

I just have to be patient :) and I will prepare the migration to future Ubuntu LTS 24.04, potentially including directly a fixed kernel.

Best regards,

pylessard commented 11 months ago

I will close that issue then. @lemeura, please reopen if you believe something needs to be addressed

hartkopp commented 10 months ago

Yes. The patch has been applied to Linus' mainline kernel tree - so it will get attention for the stable kernels soon.