linux-can / can-utils

Linux-CAN / SocketCAN user space applications
2.41k stars 712 forks source link

Candump losing frames on Rasbperry Pi #335

Open DmitriyTyp opened 2 years ago

DmitriyTyp commented 2 years ago

Hello, I have two scripts which run in parallel. One is for sending requests to ECU via cansend command with 20ms delay and another one is candump command to record frames from ECU to log file. Protocol is simple: request to ECU->response from ECU. So I expect to send 157 requests and get 157 responses. When there is only communication device<->ECU, in 90% cases all frames from ECU are recorded in log file. In 10% cases, 1 frame is not recorded. And when I simulate some additional CAN frames on the bus, the number of not recorded frames increased.

My hardware is Raspberry Pi 4 with RS485 CAN HAT board. The bitrate is 250kbps.

After OS loads, I run this command: ip link set can0 up type can bitrate 250000 sample-point 0.75 This is how I send requests to ECU: cansend can0 7F0#F4.04.00.00.52.80.D8.24 sleep 0.02 cansend can0 7F0#F4.04.00.00.52.80.D8.28 This is my candump command: candump can0, 7F1:7FF -t A -T 5000 > /home/pi/RC/CAN/Logs/log-$(date +%Y-%m-%d_%H:%M:%S).txt

Now I can't find out the problem of losing frames. Whether it is raspberry limitation or other case?

I would appreciate any comment and add any additional info. Thanks.

hartkopp commented 2 years ago

This is how I send requests to ECU: cansend can0 7F0#F4.04.00.00.52.80.D8.24 sleep 0.02 cansend can0 7F0#F4.04.00.00.52.80.D8.28

This sending setup starts three processes which are scheduled by the system - and probably not in-time. You should think about using cangen or use some of its code to start. Or even better: Create a simple program that uses the CAN_BCM sockets! With the BCM you can specify the sending of cyclic messages or sending a sequence of cyclic messages with a high precision timer from inside the Linux kernel.

https://docs.kernel.org/networking/can.html#broadcast-manager-protocol-sockets-sock-dgram

If you want to stay with can-utils tools you can also forge a logfile to send/replay it with 'canplayer`.

This is my candump command: candump can0, 7F1:7FF -t A -T 5000 > /home/pi/RC/CAN/Logs/log-$(date +%Y-%m-%d_%H:%M:%S).txt

Please try: candump can0,7F1:C00007FF -T 5000 -l which creates a proper logfile in the current directory that can be replayed with canplayer too.

The resulting logfile is the standard compact (but readable) SocketCAN logfile format and can also be converted with log2long or log2asc if you need some preprocessing with other tools.

In your case it might be (cd /home/pi/RC/CAN/Logs/; candump can0,7F1:C00007FF -T 5000 -l)

Now I can't find out the problem of losing frames. Whether it is raspberry limitation or other case?

I assume the problems to be fixed with the hints from above.

DmitriyTyp commented 2 years ago

I made a log file logsnd.log with requests to replay it with canplayer via candump. I got such kind of log: (1643028277.666468) can0 7F0#FF00 (1643028277.697728) can0 7F0#F40400005280D824 (1643028277.724488) can0 7F0#F40400005280D828 (1643028277.752854) can0 7F0#F40400005280D82C (1643028277.778638) can0 7F0#F40400005280D830 and so on up to 157 frames

and put canplayer into script cansnd.sh: #!bin/bash cd /home/pi/RC/CAN/Config/; canplayer -I logsnd.log

Then I changed candump command to: cd /home/pi/RC/CAN/Logs/; candump can0,7F1:C00007FF -T 5000 -l

Results: 1) Without additional CAN frames on the bus, I recorded 18 out of 18 files without losing of data. 2) With simulation of the real bus, I recorded 11 files. Only 5 of them are completed. In other files, I see missing of data up to 7 frames in worst case.

hartkopp commented 2 years ago
1. Without additional CAN frames on the bus, I recorded 18 out of 18 files without losing of data.

Good!

2. With simulation of the real bus, I recorded 11 files. Only 5 of them are completed. In other files, I see missing of data up to 7 frames in worst case.

Next step would be to increase the TX queue len in the CAN driver to be more robust against arbitration lost due to the traffic from the outside:

ip link set can0 up txqueuelen 500 type can bitrate 250000 sample-point 0.75

By default the txqueuelen is 10.

DmitriyTyp commented 2 years ago

With TX queue value 500, I recorded 6 files, only 2 of them were without data lost.

I tried also to increase this value to 3000, 5000 and so on. At 8000 1 file out of 10 was with data lost. With value 10000, I recorded 15 files and 3 of them were with data lost.

Should I continue with increasing this value? Is there any limitation on that? Why did you suggest 500?

hartkopp commented 2 years ago

500 was just a guess to see what happens. The MCP2515 with SPI is not a very performant setup. And based on the CAN traffic it has to make many SPI operations. Do you have an idea about your CAN busload (with and without sending the 7F0 frames)? canbusload can0@250000

candump has a -d option to see whether the user space application (here: candump) is not fast enough to process the CAN traffic from the socket. But with a RasPi4 and 250kBit/s CAN bitrate this is very unlikely.

I assume either the MCP2515/SPI to cause the issue OR the physical settings (bitrate/sample-point/bus-termination). Which is your communication counterpart CAN node? Is the CAN bus terminated correctly with 2 x 120 Ohms?

DmitriyTyp commented 2 years ago

Do you have an idea about your CAN busload (with and without sending the 7F0 frames)?

In CANalyzer statistic window, the maximum bus load was detected as 19.97%. I think the same value should be visible with canbusload can0@250000

Regarding physical settings, bitrate 250kbps is proper as this is configured inside ECU. Sample-point was set to 0.75 according to value, which I saw in CANalyzer HW setup for this bus, I took value from that. Can bit-timing setup be also a reason? I don't know about this so deeply..

I assume either the MCP2515/SPI to cause the issue OR the physical settings (bitrate/sample-point/bus-termination). Which is your communication counterpart CAN node? Is the CAN bus terminated correctly with 2 x 120

I'm communicating with real ECU for vehicle. Resistance between CAN H & CAN L was just measured and equals to 60.4 Ohms, also proper.

Could you please tell me more about candump -d option? What can I get using this? May be I should try.. But for me, I also believe, that HW setup is the reason :(

hartkopp commented 2 years ago

E.g. the Seeed CAN FD CAN HAT seems to support up to 40MHz SPI clock https://github.com/raspberrypi/linux/blob/rpi-5.4.y/arch/arm/boot/dts/overlays/seeed-can-fd-hat-v2-overlay.dts

While your RS485 CAN HAT https://www.waveshare.com/wiki/RS485_CAN_HAT seems to support 1MHz or 2MHz. That doesn't seem that much. And sending & receiving might then be more tricky than just receiving.

When you have a CANalyser: Did you check, what happens if you remove the ECU and switch the RasPi and the CANalyser to 1MBit/s bitrate and create CAN traffic loads (with the PC or RasPi)?

DmitriyTyp commented 2 years ago

E.g. the Seeed CAN FD CAN HAT seems to support up to 40MHz SPI clock https://github.com/raspberrypi/linux/blob/rpi-5.4.y/arch/arm/boot/dts/overlays/seeed-can-fd-hat-v2-overlay.dts

That seems more powerful board, but also more expensive. The goal is to create a device using the cheapest solutions

When you have a CANalyser: Did you check, what happens if you remove the ECU and switch the RasPi and the CANalyser to 1MBit/s bitrate and create CAN traffic loads (with the PC or RasPi)?

I made a CANalyzer script, that is sending random messages with ID 0x7F1 every 10 ms 157 times. ECU is disconnected and I put a resistance to CAN bus. CANalyzer HW is configured for 1Mbps. Additional frames are presented in bus. Max load is 3.09%. My RPi setup: ip link set can0 txqueuelen 10000 type can bitrate 1000000 sample-point 0.75 candump is executed like this: candump can0,7F1:C00007F1 -T 3000 -l

So I run several times candump and in parallel run script on PC to send 7F1 frames. In every case, I get logs with data missing. I noticed one thing, that if I put transmission rate of 7F1 frames to 10 ms, I record more data, than if I put it to 100 ms.

hartkopp commented 2 years ago

So I run several times candump and in parallel run script on PC to send 7F1 frames. In every case, I get logs with data missing. I noticed one thing, that if I put transmission rate of 7F1 frames to 10 ms, I record more data, than if I put it to 100 ms.

Args! That looks like some IRQ/wakeup/SPI issue :-/

Maybe @marckleinebudde can help with this as he is pretty deep in the MCP2518FD & SPI topic AFAIK ...

marckleinebudde commented 2 years ago

Try running the raspi at full clock speed:

echo performance | sudo tee /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
DmitriyTyp commented 2 years ago

Try running the raspi at full clock speed:

When I run this command in terminal it looks like some process is going on and I can't even close this. Only switch power off/on. Should I put it to background by &?

marckleinebudde commented 2 years ago

Run that command before attaching the system to the CAN bus.

Does a simple sudo work for you?

sudo whoami
DmitriyTyp commented 2 years ago

Does a simple sudo work for you?

Yes, sudo works. Ok I will try to put it to rc.local file before can bus up command.

marckleinebudde commented 2 years ago

rc.local runs as root, so make it only:

echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor

DmitriyTyp commented 2 years ago

Can I verify current configuration for CPU by this command? sudo cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Or which command would be proper? Running this, I get for each CPU(0,1,2,3) ondemand

marckleinebudde commented 2 years ago

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor shows the active governor, and ondemand shows that the setting it to performance was not successful.

DmitriyTyp commented 2 years ago

I found a solution to make all CPU run in performance mode, executing it in terminal: sudo sh -c "echo -n performance> /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor" then I check `cat /sys/devices/system/cpu/cpu/cpufreq/scaling_governorand getperformancefor all cpu* But if I run this command inrc.local,it leaves asondemand`

DmitriyTyp commented 2 years ago

So, when I enabled performance mode, I recorded 25 files. Only 13 of them are good, In other files, I see 1 or 2 frames missed.

During logging of 5th, 6th and 7th files, I noticed the error message in CANalyzer trace. After 7th file, the error got freezed and not updated anymore.

marckleinebudde commented 2 years ago

Is the outcome better or worse? What does the CAN error message say?

DmitriyTyp commented 2 years ago

Seems, it becomes more stable, as earlier I saw more data missed. Error message looks like this:

     Time          Chn     ID / Name   Name   DLC   Data                                                
 [-] 6750.803735   CAN 1                            ECC: 110000000xxxxx, Bit Error, Bit Position = 41   
       |  ECC          110000000xxxxx                               
       |  Code         Bit Error                                    
       |  Position     41                                           
       |  ID           10100111111110000001000111101b (14FF023Dx)   
       |  DLC          8                                            
       |  Data 00-07   64 00 01 FF FF FF FF FF 

I think, it's related to CANalyzer Tx frame, which I generate to simulate vehicle bus.

marckleinebudde commented 2 years ago

So there's a BIT error on the bus. Is that due to a software problem on the raspi? Don't think so.

DmitriyTyp commented 2 years ago

So there's a BIT error on the bus. Is that due to a software problem on the raspi? Don't think so.

I think no, because in ID line there is ID of the frame, which I generate on PC. And this is also not major problem, because the error was present only during 3 logging phases.

hartkopp commented 2 years ago

I would play with the "sample point" and the "three samples" option. In some cases I had problems when the Vector CAN hardware was the ONLY CAN node I was talking to from my Linux CAN interface. After adding a second node (summarizing to three CAN nodes) the problems disappeared. I assumed the second node (non Vector HW) was more robust with sampling and giving a proper ACK in the ACK field. But this is just an assumption ...

DmitriyTyp commented 2 years ago

What is "three samples" option? Actually, you saw the results. If I leave only RPi<->ECU connection + simple CANalyzer trace, all data is recorded. As soon as I add simulation of the bus -> some data is lost.

On the vehicle, I expect around 5-6 nodes. Before, I implemented CAN functions via python-can library. It was tested successfully on the desk with connection RPi<->ECU, but on the vehicle I faced problems. That's why I try different solutions now.