We see that if a 'certain' process is running, it causes ptp4l to switch into faulty state at some point. After that, phc2sys never syncs back to ptp4l.
This certain process is a proprietary application running as non-root/unprivileged but, at a very high level, it receives UDP packets, transforms the data, sends TCP packets. It does not perform any high frequency scheduling or other unusual activity. The CPU load from this application is 50-70%. We have verified that simply running a CPU stress test with 100% load does not cause ptp4l to go into this faulty state so we don't think CPU load is a contributor by itself.
We have also tried to increase tx_timestamp_timeout to 10ms but same behavior was observed. Network driver is ixgbe on Linux kernel 5.4.
We are looking for any pointers to help resolve this issue and especially figure out what's causing this transition to faulty state.
Log:
May 27 20:28:08 hostname phc2sys[534]: [1061.490] CLOCK_REALTIME phc offset 131 s0 freq +0 delay 1800
May 27 20:28:09 hostname ptp4l[629]: [1062.033] master offset -68 s3 freq -423 path delay 1779
May 27 20:28:09 hostname phc2sys[534]: [1062.490] CLOCK_REALTIME phc offset 59 s0 freq +0 delay 1807
May 27 20:28:10 hostname ptp4l[629]: [1062.666] master offset -26 s3 freq -401 path delay 1779
May 27 20:28:10 hostname ptp4l[629]: [1063.291] master offset 39 s3 freq -344 path delay 1779
May 27 20:28:10 hostname ptp4l[629]: [1063.296] timed out while polling for tx timestamp
May 27 20:28:10 hostname ptp4l[629]: [1063.296] increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
May 27 20:28:10 hostname ptp4l[629]: [1063.296] port 1: send peer delay response failed
May 27 20:28:10 hostname ptp4l[629]: [1063.296] port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
May 27 20:28:10 hostname phc2sys[534]: [1063.490] port 2046a1.fffe.06da8d-1 changed state
May 27 20:28:10 hostname phc2sys[534]: [1063.490] reconfiguring after port state change
May 27 20:28:10 hostname phc2sys[534]: [1063.490] selecting eth0 for synchronization
May 27 20:28:10 hostname phc2sys[534]: [1063.490] nothing to synchronize
May 27 20:28:26 hostname ptp4l[629]: [1079.456] port 1: FAULTY to LISTENING on INIT_COMPLETE
May 27 20:28:28 hostname ptp4l[629]: [1080.548] timed out while polling for tx timestamp
May 27 20:28:28 hostname ptp4l[629]: [1080.548] increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
May 27 20:28:28 hostname ptp4l[629]: [1080.548] port 1: send peer delay response failed
May 27 20:28:28 hostname ptp4l[629]: [1080.548] port 1: LISTENING to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
May 27 20:28:44 hostname ptp4l[629]: [1096.716] port 1: FAULTY to LISTENING on INIT_COMPLETE
May 27 20:28:45 hostname ptp4l[629]: [1097.779] port 1: new foreign master 020021.fffe.120104-28
May 27 20:28:47 hostname ptp4l[629]: [1099.822] selected best master clock 26cc35.fffe.82241b
May 27 20:28:47 hostname ptp4l[629]: [1099.822] running in a temporal vortex
May 27 20:28:47 hostname ptp4l[629]: [1099.822] port 1: LISTENING to UNCALIBRATED on RS_SLAVE
May 27 20:28:47 hostname ptp4l[629]: [1100.322] master offset 232 s3 freq -139 path delay 1779
May 27 20:28:47 hostname phc2sys[534]: [1100.492] port 2046a1.fffe.06da8d-1 changed state
May 27 20:28:47 hostname phc2sys[534]: [1100.492] reconfiguring after port state change
May 27 20:28:47 hostname phc2sys[534]: [1100.492] master clock not ready, waiting...
May 27 20:28:48 hostname ptp4l[629]: [1100.957] master offset -28 s3 freq -330 path delay 1780
May 27 20:28:49 hostname ptp4l[629]: [1101.581] master offset -60 s3 freq -370 path delay 1780
May 27 20:28:49 hostname ptp4l[629]: [1102.208] master offset -112 s3 freq -440 path delay 1780
May 27 20:28:50 hostname ptp4l[629]: [1102.833] master offset -95 s3 freq -457 path delay 1781
We see that if a 'certain' process is running, it causes ptp4l to switch into faulty state at some point. After that, phc2sys never syncs back to ptp4l.
This certain process is a proprietary application running as non-root/unprivileged but, at a very high level, it receives UDP packets, transforms the data, sends TCP packets. It does not perform any high frequency scheduling or other unusual activity. The CPU load from this application is 50-70%. We have verified that simply running a CPU stress test with 100% load does not cause ptp4l to go into this faulty state so we don't think CPU load is a contributor by itself.
We have also tried to increase
tx_timestamp_timeout
to 10ms but same behavior was observed. Network driver isixgbe
on Linux kernel 5.4.We are looking for any pointers to help resolve this issue and especially figure out what's causing this transition to faulty state.
Log:
/etc/linuxptp.cfg
:Process arguments: