Closed philippoo66 closed 11 months ago
Do you have another socket listening on the same CAN ids? If that is the case, you may be encountering a known race condition in the kernel module. Another very similar issue has been raised recently, please have a look at : https://github.com/pylessard/python-udsoncan/issues/178
A time out on send should happen if the receiver fails to send a flow control message after receiving a first frame/consecutive frame. If nothing goes out at all, it might be a bug indication.
thank you for the quick reply!
I will discuss in the group if there might be another socket listening, but I don't think so.
The flow control is definitely not the problem since even the first frame isn't 'delivered' onto the bus.
Hey @philippoo66, I did a quick read through the discussion you referenced. The issue sounds very much like the race condition that was discussed in https://github.com/pylessard/python-udsoncan/issues/178. The patch for the issue is on its way to being distributed in the stable kernel releases. As @pylessard suggested in the other thread, you can temporarily switch the UDS client to the all-Python implementation of ISO-TP, i.e. PythonIsoTpConnection. If the issue goes away, it's likely that you're also being affected by this race condition.
The bug was located in the ISO-TP kernel module. It triggers as soon as there are two threads operating on the same ISO-TP kernel socket simultaneously. This condition is always fulfilled if you use IsoTPSocketConnection, which I think is what you're doing judging from the stack trace you printed. The connection class uses a background thread to continuously read input data from the socket. If a send operation occurs on a different thread while it is waiting on input data, the race condition is triggered and can cause the timeout exception to occur. The exception only occurs if the send operation loses the race against the read operation.
hi @lumagi thank you for the input! Please be sorry, I'm not really that deep into the things, so I need to ask some questions...
'We' by ourself / our software are not doing anything else at the same time, but certainly udsoncan/iso-tp might listen to the bus at least after sending some request in order to receive the response. But I don't think this is a problem?
Due to the kind of arbitration with CAN, attempting sending a frame requires 'listening' to the bus. There are other producers sending messages 'all the time', but I think this kind of listening is done by the CAN controller located on the adapter. This shouldn't affect the ISO-TP kernel?
Also we are doing candump sometimes (certainly in a different thread) to see what happens. Otherwise we wouldn't be able to determine that nothing gets sent onto the bus. But we started to do so after we experienced the error. Do you think doing candump while doing a read_data_by_identifier() might cause or pander the issue?
I will ask the others if we might switch to the python-only way in those cases where the issue arise. On the other hand an error handler and a retry might do it also, and if there is a fix in the pipe... ;-)
Sure, I will try to answer as best I can. You assumed correctly, IsoTPSocketConnection, which is responsible for sending requests and receiving responses from the UDS server is continuously listening for input from the bus. It does so even if it didn't send a request.
The issue is not the client application, it's a bug in the operating system / the kernel. This means that the time out error you're seeing is not actually caused by some time out or error condition on the physical CAN bus but rather by an issue in the ISO-TP implementation in the operating system. So your bus and the CAN layer are fine. Doing candump at the same time is also not a problem.
The problem occurs when the UDS client is trying to send a request. At this point, the send operation that's trying to send the request is fighting with the receive operation that's waiting for data on the bus. They wouldn't need to be fighting and that's the bug. But they do until one of them times out. If it's the receive operation that times out first, it'll just restart and the send can occur in the meantime. If it's the send operation that times out first, you'll see the error message you reported.
thank you very much for the explanation! I think so far I understand now. What I'm not clear about now is who will fix the 'issue in the ISO-TP implementation in the operating system'? You mentioned the kernel - I thought each OS has its own kernel, and since there are so many Linux derivatives, there might be a lot of kernels? And after the fix(es), all OSs' on all the machines operating udsoncan/iso-tp need to get updated? Or is there some "iso-tp 'side kernel'" coming with iso-tp installation? (sorry again for my poor knowledge!)
All Linux distributions (Debian, Ubuntu, Raspbian, ...) use the Linux kernel as part of the operating system. The fix for the issue has been accepted into the kernel source code. But the kernel developers are not responsible for updating every distribution. The distribution maintainers must now update their kernel version. Once that's done you'll receive the fixed version with an update.
great, now I got it. Thank you very much for all your explaining!
another question - sorry again! is it possible to say in which kernel version the iso-tp fix is implemented?
the idea behind is: for now install an (old) OS with (old) kernel version where iso-tp is running fine 24/7. do not install an update before OS includes kernel version with iso-tp fix?!?
I think it's not possible to change only the kernel within the OS?
Sure, you can try downgrading to a kernel version that's not affected. The patch that causes the regression was introduced in April. If you move to a version that was released before that time you should be fine. But I can't help you with how to do that. It depends on the package management of your distribution and if they still keep the old kernel versions around.
@philippoo66 maybe this would also be something for you: @hartkopp created a special release of his out-of-source version of the isotp driver. Check out #93 for a link to the repo. With it, you can manually build the module in a version that already has the fix applied.
thank you very much for the info! sorry again for questions of an untaught person - as far as I understood it's not simply possible to include that in our project to temporarily 'bypass' the 'system-isotp'?
You need to be aware of that often the users using our project are (similar to me) hardly able to do a git clone https://github.com/abnoname/open3e.git and then a pip3 install -r requirements.txt Afterwards they copy&paste as command line example to 'run the engine'. Anything further they do in their home automation application, where they know what to do.
I'm afraid it's even harder to explain to all of them than even to me to do an out-of-tree create of the isotp driver in a special branch and make the system using this instead of the already installed one?
but don't worry! I'm happy using Buster and in most other cases the dids are polled cyclically, so if a value doesn't get read, it usually will on the next cycle and everything is fine :-)
Closing this issue. I have added an item in the documentation troubleshooting section
Dear Pier-Yves,
we have a problem occuring (only) from time to time (only) on some machines/configurations:
Here twice everything went fine and on the third time _socket.send(*args, **kwargs) failed. What we see is that in this moment nothing gets sent on the CAN bus.
Do you have any idea what happens and how this can get fixed?
Configuration of that machine see here https://github.com/abnoname/open3e/issues/10#issuecomment-1767931052 The CAN Adapter is Innomaker USB2CAN.
We experienced the issue on other systems (Raspi, different OS, different CAN Adapter) randomly also. We have also systems where everything is running fine 24/7 since months.
If any further information is required, please let us know.
Thank you very much! Phil