nxp-archive / openil

OpenIL is an open source project based on Buildroot and designed for embedded industrial solution.
Other
136 stars 55 forks source link

[LS1021ATSN - 802.1AS gPTP] Sync fault when I stress cores of the board with priority 99 #88

Closed diegotxegp closed 3 years ago

diegotxegp commented 3 years ago

My network are a LS1021ATSN platform with two boards. 802.1AS (gPTP) is the standard used.

LS1021ATSN is the GM and one board is the slave (the other is not important now). When I run ptp4l in both devices, it work fine until I run another tool to simulate load in the cores (4 threads with 25% of load), when syncrhonization fails. These are the results. How could I solve that fail?

Stress tool: stress-ng --cpu 4 --cpu_load 25 --sched fifo --sched-prio 99 --times

sudo ptp4l -i enp3s0 -f /gPTP.cfg --tx_timestamp_timeout 20 --step_threshold 0.0 --first_step_threshold 0.00002 -m [...] ptp4l[251.624]: rms 4 max 7 freq +16906 +/- 3 delay 232 +/- 0 ptp4l[252.626]: rms 5 max 9 freq +16897 +/- 6 delay 232 +/- 0 ptp4l[253.627]: rms 2 max 4 freq +16898 +/- 3 delay 232 +/- 0 ptp4l[254.629]: rms 2 max 4 freq +16899 +/- 3 delay 232 +/- 0 ptp4l[255.630]: rms 3 max 5 freq +16904 +/- 3 delay 232 +/- 0 ptp4l[256.631]: rms 3 max 5 freq +16905 +/- 4 delay 232 +/- 0 ptp4l[257.632]: rms 3 max 4 freq +16902 +/- 4 delay 232 +/- 0 ptp4l[258.634]: rms 5 max 9 freq +16903 +/- 6 delay 232 +/- 0 ptp4l[259.635]: rms 3 max 6 freq +16905 +/- 4 delay 232 +/- 0 ptp4l[260.784]: clockcheck: clock jumped backward or running slower than expected! ptp4l[260.785]: rms 2 max 3 freq +16904 +/- 3 delay 232 +/- 0 ptp4l[260.785]: port 1 (enp3s0): SLAVE to UNCALIBRATED on SYNCHRONIZATION_FAULT ptp4l[260.785]: port 1 (enp3s0): UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED ptp4l[260.785]: port 1 (enp3s0): rogue peer delay response ptp4l[260.785]: port 1 (enp3s0): SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED) ptp4l[277.696]: port 1 (enp3s0): FAULTY to LISTENING on INIT_COMPLETE ptp4l[282.179]: clockcheck: clock jumped backward or running slower than expected! ptp4l[282.181]: port 1 (enp3s0): new foreign master 00049f.fffe.ef0808-1 ptp4l[282.182]: port 1 (enp3s0): LISTENING to MASTER on ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES ptp4l[282.182]: selected local clock aabbcc.fffe.00094e as best master ptp4l[282.182]: port 1 (enp3s0): assuming the grand master role ptp4l[282.290]: clockcheck: clock jumped forward or running faster than expected! ptp4l[283.684]: selected best master clock 00049f.fffe.ef0808 ptp4l[283.684]: port 1 (enp3s0): MASTER to UNCALIBRATED on RS_SLAVE ptp4l[284.419]: port 1 (enp3s0): UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED ptp4l[285.045]: rms 515 max 693 freq +16389 +/- 422 delay 232 +/- 0 ptp4l[286.745]: port 1 (enp3s0): SLAVE to MASTER on ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES ptp4l[286.745]: selected local clock aabbcc.fffe.00094e as best master ptp4l[286.745]: port 1 (enp3s0): assuming the grand master role ptp4l[286.757]: clockcheck: clock jumped backward or running slower than expected! ptp4l[286.921]: clockcheck: clock jumped forward or running faster than expected! ptp4l[287.684]: selected best master clock 00049f.fffe.ef0808 ptp4l[287.685]: port 1 (enp3s0): MASTER to UNCALIBRATED on RS_SLAVE ptp4l[287.924]: port 1 (enp3s0): UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED ptp4l[288.049]: rms 414 max 677 freq +16715 +/- 448 delay 232 +/- 0 ptp4l[289.050]: rms 313 max 559 freq +17487 +/- 170 delay 235 +/- 0 ptp4l[290.658]: clockcheck: clock jumped backward or running slower than expected! ptp4l[290.658]: port 1 (enp3s0): SLAVE to UNCALIBRATED on SYNCHRONIZATION_FAULT ptp4l[290.658]: port 1 (enp3s0): UNCALIBRATED to MASTER on ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES ptp4l[290.658]: selected local clock aabbcc.fffe.00094e as best master ptp4l[290.658]: port 1 (enp3s0): assuming the grand master role ptp4l[290.678]: timed out while polling for tx timestamp ptp4l[290.678]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug ptp4l[290.678]: port 1 (enp3s0): send peer delay response failed ptp4l[290.678]: port 1 (enp3s0): MASTER to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)

It fails when stress tool has 99 priority and ptp4l no, when both has priority 99 (chrt -f -p 99 "pid of ptp4l") and when ptp4l has priority 99 and the stress tool only priority 50.

Thank you.

Kind regards, Diego

diegotxegp commented 3 years ago

SOLVED. The problem was that the interrupts of the network card where PTP transport is done, jumped to the cores where stress-ng was working. The affinity of the irq interrupts must be manipulated for just using the cores not isolated like in my case.

I hope this is useful for someone.