timed out while polling for tx timestamp

We see that if a 'certain' process is running, it causes ptp4l to switch into faulty state at some point. After that, phc2sys never syncs back to ptp4l.

This certain process is a proprietary application running as non-root/unprivileged but, at a very high level, it receives UDP packets, transforms the data, sends TCP packets. It does not perform any high frequency scheduling or other unusual activity. The CPU load from this application is 50-70%. We have verified that simply running a CPU stress test with 100% load does not cause ptp4l to go into this faulty state so we don't think CPU load is a contributor by itself.

We have also tried to increase tx_timestamp_timeout to 10ms but same behavior was observed. Network driver is ixgbe on Linux kernel 5.4.

We are looking for any pointers to help resolve this issue and especially figure out what's causing this transition to faulty state.

Log:

May 27 20:28:08 hostname phc2sys[534]: [1061.490] CLOCK_REALTIME phc offset       131 s0 freq      +0 delay   1800
May 27 20:28:09 hostname ptp4l[629]: [1062.033] master offset        -68 s3 freq    -423 path delay      1779
May 27 20:28:09 hostname phc2sys[534]: [1062.490] CLOCK_REALTIME phc offset        59 s0 freq      +0 delay   1807
May 27 20:28:10 hostname ptp4l[629]: [1062.666] master offset        -26 s3 freq    -401 path delay      1779
May 27 20:28:10 hostname ptp4l[629]: [1063.291] master offset         39 s3 freq    -344 path delay      1779
May 27 20:28:10 hostname ptp4l[629]: [1063.296] timed out while polling for tx timestamp
May 27 20:28:10 hostname ptp4l[629]: [1063.296] increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
May 27 20:28:10 hostname ptp4l[629]: [1063.296] port 1: send peer delay response failed
May 27 20:28:10 hostname ptp4l[629]: [1063.296] port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
May 27 20:28:10 hostname phc2sys[534]: [1063.490] port 2046a1.fffe.06da8d-1 changed state
May 27 20:28:10 hostname phc2sys[534]: [1063.490] reconfiguring after port state change
May 27 20:28:10 hostname phc2sys[534]: [1063.490] selecting eth0 for synchronization
May 27 20:28:10 hostname phc2sys[534]: [1063.490] nothing to synchronize
May 27 20:28:26 hostname ptp4l[629]: [1079.456] port 1: FAULTY to LISTENING on INIT_COMPLETE
May 27 20:28:28 hostname ptp4l[629]: [1080.548] timed out while polling for tx timestamp
May 27 20:28:28 hostname ptp4l[629]: [1080.548] increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
May 27 20:28:28 hostname ptp4l[629]: [1080.548] port 1: send peer delay response failed
May 27 20:28:28 hostname ptp4l[629]: [1080.548] port 1: LISTENING to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
May 27 20:28:44 hostname ptp4l[629]: [1096.716] port 1: FAULTY to LISTENING on INIT_COMPLETE
May 27 20:28:45 hostname ptp4l[629]: [1097.779] port 1: new foreign master 020021.fffe.120104-28
May 27 20:28:47 hostname ptp4l[629]: [1099.822] selected best master clock 26cc35.fffe.82241b
May 27 20:28:47 hostname ptp4l[629]: [1099.822] running in a temporal vortex
May 27 20:28:47 hostname ptp4l[629]: [1099.822] port 1: LISTENING to UNCALIBRATED on RS_SLAVE
May 27 20:28:47 hostname ptp4l[629]: [1100.322] master offset        232 s3 freq    -139 path delay      1779
May 27 20:28:47 hostname phc2sys[534]: [1100.492] port 2046a1.fffe.06da8d-1 changed state
May 27 20:28:47 hostname phc2sys[534]: [1100.492] reconfiguring after port state change
May 27 20:28:47 hostname phc2sys[534]: [1100.492] master clock not ready, waiting...
May 27 20:28:48 hostname ptp4l[629]: [1100.957] master offset        -28 s3 freq    -330 path delay      1780
May 27 20:28:49 hostname ptp4l[629]: [1101.581] master offset        -60 s3 freq    -370 path delay      1780
May 27 20:28:49 hostname ptp4l[629]: [1102.208] master offset       -112 s3 freq    -440 path delay      1780
May 27 20:28:50 hostname ptp4l[629]: [1102.833] master offset        -95 s3 freq    -457 path delay      1781

/etc/linuxptp.cfg:

[global]
#
# Options carried over from gPTP.
#
gmCapable       1
priority1       255
priority2       255
logSyncInterval     -3
syncReceiptTimeout  3
neighborPropDelayThresh 800
min_neighbor_prop_delay -20000000
assume_two_step     1
path_trace_enabled  1
follow_up_info      1
transportSpecific   0x1
ptp_dst_mac     01:80:C2:00:00:0E
network_transport   L2
delay_mechanism     P2P
#
# Automotive Profile specific options
#
BMCA            ptp
slaveOnly       1
inhibit_announce    1
asCapable               true
ignore_source_id    1
# Required to quickly correct Time Jumps in master
step_threshold          1
operLogSyncInterval     0
operLogPdelayReqInterval 2
msg_interval_request     1
servo_offset_threshold   30
servo_num_offset_values  10

[eth0]

Process arguments:

/usr/sbin/ptp4l -f /etc/linuxptp.cfg

/usr/sbin/phc2sys -a -r -E ntpshm -M 9 --transportSpecific=1

nxp-archive / openil_linuxptp

timed out while polling for tx timestamp #37