nxp-archive / openil

OpenIL is an open source project based on Buildroot and designed for embedded industrial solution.
Other
136 stars 55 forks source link

LS1021ATSN/gPTP: Sync loss when accessing to clock to read #85

Closed diegotxegp closed 3 years ago

diegotxegp commented 3 years ago

Hello,

My issue is that a sync loss happens when I do some tests in one of the boards of my network.

My network are a LS1021ATSN and two boards conected to it using gPTP. One board acts as a master, the switch as a bridge and the other board as a slave. I am using the daemon in each device, not the ptp4l.service (if ptp4l.service is recomendable, let me know).

With these instructions I synchronize the devices: LS1021ATSN sudo ptp4l -i swp2 -i swp2 -f /gPTP.cfg --tx_timestamp_timeout 20 -m

NODES sudo ptp4l -i swp2 -i swp2 -f /gPTP.cfg --tx_timestamp_timeout 20 -m

So, it works perfectly with max offsets like 7-18ns until I run some tests where acceses to the time of the synchronized clock and suddenly this message raises.

ptp4l[3523.714]: rms 7 max 10 freq +15982 +/- 9 delay 232 +/- 0 ptp4l[3524.715]: rms 6 max 11 freq +15987 +/- 8 delay 233 +/- 0 ptp4l[3525.717]: rms 8 max 17 freq +15981 +/- 10 delay 234 +/- 0 ptp4l[3526.719]: rms 6 max 11 freq +15982 +/- 9 delay 233 +/- 0 ptp4l[3526.872]: timed out while polling for tx timestamp ptp4l[3526.872]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug ptp4l[3526.872]: port 1 (enp3s0): send peer delay response failed ptp4l[3526.872]: port 1 (enp3s0): SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED) ptp4l[3543.080]: port 1 (enp3s0): FAULTY to LISTENING on INIT_COMPLETE ptp4l[3546.846]: port 1 (enp3s0): new foreign master 00049f.fffe.ef0808-1 ptp4l[3546.906]: port 1 (enp3s0): LISTENING to MASTER on ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES ptp4l[3546.906]: selected local clock aabbcc.fffe.00094e as best master ptp4l[3546.906]: port 1 (enp3s0): assuming the grand master role ptp4l[3548.846]: selected best master clock aabbcc.fffe.00094a ptp4l[3548.846]: port 1 (enp3s0): MASTER to UNCALIBRATED on RS_SLAVE ptp4l[3549.128]: port 1 (enp3s0): UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED ptp4l[3549.879]: rms 176 max 297 freq +15694 +/- 128 delay 234 +/- 0 ptp4l[3550.880]: rms 43 max 70 freq +15868 +/- 55 delay 233 +/- 0 ptp4l[3551.882]: rms 63 max 73 freq +15976 +/- 16 delay 232 +/- 0 ptp4l[3552.883]: rms 39 max 58 freq +15994 +/- 10 delay 232 +/- 0 ptp4l[3553.885]: rms 13 max 28 freq +15977 +/- 13 delay 232 +/- 0 ptp4l[3554.886]: rms 7 max 9 freq +15966 +/- 10 delay 232 +/- 0 ptp4l[3555.888]: rms 9 max 15 freq +15959 +/- 10 delay 232 +/- 0

Could you help me? I use the OpenIL version 1.8 and the Linuxptp 3.1 on the boards.

vladimiroltean commented 3 years ago

Does the situation improve if you increase the tx_timestamp_timeout to a higher value like 50 ms?

diegotxegp commented 3 years ago

Hello Vladimir,

Does the situation improve if you increase the tx_timestamp_timeout to a higher value like 50 ms?

No at all. The same text appears. In fact, I tried a huge number for --tx_timestamp_timeout 10000 and this text appears:

ptp4l[1034.934]: rms 6 max 12 freq -285 +/- 8 delay 235 +/- 0 ptp4l[1035.936]: rms 6 max 11 freq -277 +/- 5 delay 235 +/- 0 ptp4l[1036.939]: rms 4 max 7 freq -280 +/- 6 ptp4l[1038.091]: clockcheck: clock jumped backward or running slower than expected! ptp4l[1038.092]: port 1 (enp3s0): SLAVE to LISTENING on ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES ptp4l[1038.092]: selected local clock aabbcc.fffe.00094e as best master ptp4l[1038.191]: clockcheck: clock jumped forward or running faster than expected! ptp4l[1039.017]: selected best master clock aabbcc.fffe.00094a ptp4l[1039.017]: port 1 (enp3s0): LISTENING to UNCALIBRATED on RS_SLAVE ptp4l[1039.932]: port 1 (enp3s0): UNCALIBRATED to LISTENING on ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES ptp4l[1039.932]: selected local clock aabbcc.fffe.00094e as best master ptp4l[1039.932]: clockcheck: clock jumped backward or running slower than expected! ptp4l[1040.441]: clockcheck: clock jumped forward or running faster than expected! ptp4l[1040.575]: clockcheck: clock jumped forward or running faster than expected! ptp4l[1041.017]: selected best master clock aabbcc.fffe.00094a ptp4l[1041.017]: port 1 (enp3s0): LISTENING to UNCALIBRATED on RS_SLAVE ptp4l[1043.103]: clockcheck: clock jumped backward or running slower than expected! ptp4l[1043.104]: port 1 (enp3s0): UNCALIBRATED to LISTENING on ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES ptp4l[1043.104]: selected local clock aabbcc.fffe.00094e as best master ptp4l[1043.106]: selected best master clock aabbcc.fffe.00094a ptp4l[1043.106]: port 1 (enp3s0): LISTENING to UNCALIBRATED on RS_SLAVE ptp4l[1043.106]: port 1 (enp3s0): rogue peer delay response ptp4l[1043.106]: port 1 (enp3s0): UNCALIBRATED to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED) ptp4l[1059.180]: port 1 (enp3s0): FAULTY to LISTENING on INIT_COMPLETE ptp4l[1062.685]: selected local clock aabbcc.fffe.00094e as best master ptp4l[1063.021]: port 1 (enp3s0): new foreign master 00049f.fffe.ef0808-1 ptp4l[1065.021]: selected best master clock aabbcc.fffe.00094a ptp4l[1065.021]: port 1 (enp3s0): LISTENING to UNCALIBRATED on RS_SLAVE ptp4l[1065.392]: port 1 (enp3s0): UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED ptp4l[1066.145]: rms 212 max 305 freq -585 +/- 167 delay 236 +/- 1 ptp4l[1067.147]: rms 70 max 102 freq -398 +/- 66 delay 239 +/- 0 ptp4l[1068.150]: rms 92 max 109 freq -262 +/- 14 delay 239 +/- 0 ptp4l[1069.152]: rms 55 max 76 freq -240 +/- 6 delay 238 +/- 0

diegotxegp commented 3 years ago

Apart from the comment just above, I want to ask you some questions.

I have a test network with a LS1021ATSN platform and two boards connected to it on ETH2 and ETH3, according to the label in the front of the device. It would be swp2 and swp3 as I use in ptp4l.

What I want is create two different scenarios: one where the LS1021ATSN switch is the GM, and the boards the slaves which synchronized their clock to the GM time, and another where one board is the GM, and the LS1021ATSN acts as a bridge and the other board is a slave receiving the time through the switch. I have to remember that I want to use gPTP (802.1AS), not PTP (IEEE 1588). This detail is important.

My questions are:

1) Is it recommedable using the ptp4l.service on the LS1021ATSN switch and how would it be? 2) Is it better using the ptp4l daemon in each device (LS1021ATSN, board 1 and board 2) instead of the ptp4l.service? 3) What steps would you do if you did that? 4) I want that slaves clocks never go back in time. Never go back. How would I do that? --step_threshold? It is by default at 0.0. 5) I am using the OpenIL version 1.8 with an inner linuxptp version 2.0 in the LS1021ATSN platform, and a linuxptp version 3.1 in the boards. Does it influence in some bad behaviour? 6) In latest version of OpenIL, do you have solve errors for the LS1021ATSN platform? Do you recommend me to update my OpenIL OS?

These are the questions I have right now in my mind.

I just want to create that network I told you and synchronize the time among them to do some tests with global clock.

Thank you so much for your prompt reply.

vladimiroltean commented 3 years ago

No at all. The same text appears.

No it doesn't, in the first log you had:

ptp4l[3526.872]: timed out while polling for tx timestamp
ptp4l[3526.872]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug

and in the second log you don't. But now I notice that the logs you're sharing are on some other device, not the LS1021A-TSN board?

it works perfectly with max offsets like 7-18ns until I run some tests where acceses to the time of the synchronized clock and suddenly this message raises

What are you doing exactly to break it? Can I see some logs on the actual board?

Is it recommedable using the ptp4l.service on the LS1021ATSN switch and how would it be?

The /usr/lib/systemd/system/ptp4l.service systemd service that is preinstalled uses /etc/linuxptp.cfg which should provide a good starting point for an 802.1AS bridge on swp2-swp5. I don't know of any issues with it. It may not be what you need for your testing though.

Is it better using the ptp4l daemon in each device (LS1021ATSN, board 1 and board 2) instead of the ptp4l.service?

Again, it depends what you need it to do. If by ptp4l daemon you mean starting/stopping ptp4l directly from a script, you could do that too.

What steps would you do if you did that?

I might just happen to have some examples which might give you some ideas. Keep in mind that you'll probably need to adapt them heavily though - just use them as a starting point.

Board A ETH0 <-> Board B SWP2
Board A ETH1 <-> Board B SWP3

 +---------------------------------------------------+
 | LS1021A-TSN board B (bridge)                      |
 |                                                   |
 |   +-----------+   +-----------+   +-----------+   |
 |   |           |   |           |   |           |   |
 |   |           |   |           |   |           |   |
 |   |           |   |     +--------+|           |   |
 |   |    SWP5   |   |    SWP3   |  ||    ETH1   |   |
 |   +-----------+   +-----------+  |+-----------+   |
 |   |           |   |           |  ||           |   |
 |   |           |   |           |  ||           |   |
 |   |           |   |     +------+ ||           |   |
 |   |    SWP4   |   |    SWP2   || ||    ETH0   |   |
 +---------------+---------------+|-|------------+---+
                                  +-----------------------+
                                    |                     |
                                    +------------------+  |
                                                       |  |
 +---------------------------------------------------+ |  |
 | LS1021A-TSN board A (sender & receiver)           | |  |
 |                                                   | |  |
 |   +-----------+   +-----------+   +-----------+   | |  |
 |   |           |   |           |   |           |   | |  |
 |   |           |   |           |   |           |   | |  |
 |   |           |   |           |   |     +-----------+  |
 |   |    SWP5   |   |    SWP3   |   |    ETH1   |   |    |
 |   +-----------+   +-----------+   +-----------+   |    |
 |   |           |   |           |   |           |   |    |
 |   |           |   |           |   |           |   |    |
 |   |           |   |           |   |     +--------------+
 |   |    SWP4   |   |    SWP2   |   |    ETH0   |   |
 +---------------+---------------+---------------+---+

The point of this example is the measure the forwarding latency of the switch when the time-aware scheduler is enabled, and some packets are sent in-band with the schedule. OpenIL has a program available called "isochron" which can measure the packet latency by taking MAC-level hardware timestamps of those packets at the sender and at the receiver, and compare them (it assumes that the sender and receiver are synchronized over PTP). There is also the possibility that the isochron sender and receiver run on the same board, and this is precisely what the test does: it sets up an isochron sender on ETH0 of board A, a receiver on ETH1 of the same board, and make board B a time-aware bridge whose egress port is swp3.

What I did in the attached scripts is simply start/stop PTP on demand, and configure isochron to send 10 packets and measure their latency (I'll spare you the more subtle aspects of the test as they are probably not relevant to the topic which is just PTP).

Create these script files in a folder and transfer that folder to both boards: check_sync.sh:

#!/bin/bash

set -e -u -o pipefail

scrape_logs_for_phc2sys_offset() {
    local awk_program='/phc2sys/ { print $10; exit; }'

    journalctl -b -n 10 --no-pager > ptp.log

    echo $(cat ptp.log | awk "${awk_program}")
}

scrape_logs_for_ptp4l_offset() {
    local awk_program='/ptp4l/ { print $8; exit; }'

    journalctl -b -n 10 --no-pager > ptp.log

    echo $(cat ptp.log | awk "${awk_program}")
}

check_sync_phc2sys() {
    local threshold_ns=50
    local system_clock_offset=

    while :; do
        sleep 1

        system_clock_offset=$(scrape_logs_for_phc2sys_offset)

        # Got something, is it a number?
        case "${system_clock_offset}" in
        ''|[!\-][!0-9]*)
            if ! pidof phc2sys > /dev/null; then
                echo "Please start the phc2sys service."
                return 1
            else
                echo "No message from phc2sys, trying again..."
                continue
            fi
            ;;
        esac

        if [ "${system_clock_offset}" -lt 0 ]; then
            system_clock_offset=$((-${system_clock_offset}))
        fi
        echo "System clock offset ${system_clock_offset} ns"
        if [ "${system_clock_offset}" -gt "${threshold_ns}" ]; then
            echo "System clock is not yet synchronized..."
            continue
        fi
        # Success
        break
    done
}

check_sync_ptp4l() {
    local threshold_ns=100
    local phc_offset=

    while :; do
        sleep 1

        phc_offset=$(scrape_logs_for_ptp4l_offset)

        # Got something, is it a number?
        case "${phc_offset}" in
        ''|[!\-][!0-9]*)
            if ! pidof phc2sys > /dev/null; then
                echo "Please start the phc2sys service."
                return 1
            else
                echo "No message from ptp4l, trying again..."
                continue
            fi
            ;;
        esac

        echo "Master offset ${phc_offset} ns"
        if [ "${phc_offset}" -lt 0 ]; then
            phc_offset=$((-${phc_offset}))
        fi
        if [ "${phc_offset}" -gt "${threshold_ns}" ]; then
            echo "PTP clock is not yet synchronized..."
            continue
        fi
        # Success
        break
    done
}

check_sync_ptp4l
check_sync_phc2sys

ptp_start.sh:

#!/bin/bash

set -e -u -o pipefail

board=$@

schedule() {
    taskset $((1 << 0)) chrt -r 20 $@
}

restart_ptp() {
    # Start PTP time from 2021 and reset frequency
    killall ptp4l phc2sys || :

    case ${board} in
    1)
        schedule ptp4l -i eth0 -i eth1 -f /etc/ptp4l_cfg/gPTP.cfg \
            --gmCapable 0 --tx_timestamp_timeout 50 \
            --step_threshold 0.00002 --first_step_threshold 0.00002 &
        sleep 1
        schedule phc2sys -a -rr --transportSpecific 0x1 \
            --step_threshold 0.00002 --first_step_threshold 0.00002 &
        ;;
    2)
        # Set initial time somewhere in 2021 and run with that
        phc_ctl CLOCK_REALTIME set 1612270628.000000000 && phc_ctl CLOCK_REALTIME freq 0
        phc_ctl /dev/ptp1 set 1612270628.000000000 && phc_ctl /dev/ptp1 freq 0

        schedule ptp4l -i swp2 -i swp3 -f /etc/ptp4l_cfg/gPTP.cfg \
            --tx_timestamp_timeout 50 \
            --step_threshold 0.00002 --first_step_threshold 0.00002 &
        sleep 1
        schedule phc2sys -a -rr --transportSpecific 0x1 \
            --step_threshold 0.00002 --first_step_threshold 0.00002 &
        ;;
    *)
        echo "Usage: ptp_start.sh 1|2|3"
        exit 0
    esac
}

restart_ptp

test_taprio_2port.sh:

#!/bin/bash

export TOPDIR=$(cd "$(dirname "${BASH_SOURCE[0]}" )" && pwd)

do_cleanup() {
    echo "Cleaning up"
    killall ptp4l phc2sys
    ip link del dev br0
    tc qdisc del dev swp2 clsact
    tc qdisc del dev swp3 clsact
    tc qdisc del dev swp3 root
}
trap do_cleanup EXIT

systemctl disable --now systemd-networkd
systemctl disable --now ptp4l
systemctl disable --now phc2sys
# No NTP
systemctl disable --now systemd-timesyncd

devlink dev param set spi/spi0.1 name best_effort_vlan_filtering value true cmode runtime
ip link del br0

ip link set eth2 up
ip link set swp2 up
ip link set swp3 up
ip link set swp4 up
ip link set swp5 up

ip link add br0 type bridge vlan_filtering 1
ip link set swp2 master br0
ip link set swp3 master br0
ip link set swp4 master br0
ip link set swp5 master br0
ip link set br0 up
bridge vlan add dev swp2 vid 100 master
bridge vlan add dev swp3 vid 100 master
bridge vlan add dev br0 vid 100 self
tc qdisc add dev swp2 clsact
tc qdisc add dev swp3 clsact
tc qdisc add dev swp4 clsact
tc qdisc add dev swp5 clsact
tc qdisc replace dev swp3 parent root handle 100 taprio \
    num_tc 8 \
    map 0 1 2 3 4 5 6 7 \
    queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
    base-time 0 \
    sched-entry S 81 50000000 sched-entry S 82 50000000 \
    sched-entry S 84 50000000 sched-entry S 88 50000000 \
    sched-entry S 90 50000000 sched-entry S a0 50000000 \
    sched-entry S c0 50000000 sched-entry S 80 50000000 \
    sched-entry S 81 50000000 sched-entry S 81 50000000 \
    sched-entry S 81 50000000 sched-entry S 81 50000000 \
    sched-entry S 81 50000000 sched-entry S 81 50000000 \
    sched-entry S 81 50000000 sched-entry S 81 50000000 \
    sched-entry S 81 50000000 sched-entry S 81 50000000 \
    sched-entry S 81 50000000 sched-entry S 81 50000000 \
    flags 0x2
${TOPDIR}/ptp_start.sh 2
echo "Config done, waiting to be killed with Ctrl-C when test is done"
read

test_taprio_sender.sh:

#!/bin/bash

export TOPDIR=$(cd "$(dirname "${BASH_SOURCE[0]}" )" && pwd)

isochron() {
    taskset $((1 << 1)) ${TOPDIR}/isochron $@
}

do_cleanup() {
    echo "Cleaning up"
    killall ptp4l phc2sys
    killall isochron
}
trap do_cleanup EXIT

systemctl disable --now systemd-networkd
systemctl disable --now ptp4l
systemctl disable --now phc2sys
# No NTP
systemctl disable --now systemd-timesyncd
ip link del br0

ip link set eth0 up
ip link set eth1 up

${TOPDIR}/ptp_start.sh 1
${TOPDIR}/check_sync.sh

dmac=$(ip link show eth1 | awk '/link\/ether/ { print $2 }')

isochron rcv \
    --interface eth1 \
    --sched-rr \
    --sched-priority 98 \
    --etype 0xdead &

# base-time offset is aligned with the "a0" gate entry
isochron send \
    --interface eth0 \
    --dmac ${dmac} \
    --priority 5 \
    --base-time    0.300000000 \
    --advance-time 0.000000000 \
    --cycle-time   1.000000000 \
    --num-frames 10 \
    --frame-size 64 \
    --vid 100 \
    --client 127.0.0.1 \
    --etype 0xdead \
    --sched-rr \
    --sched-priority 98

The scripts should be run as follows:

# On board B:
./test_taprio_2port.sh
# On board A:
./test_taprio_sender.sh

by the way, if you do end up actually running this test, you'll have to compile the isochron program from source, from the master branch, since some of the options were newly added: https://github.com/vladimiroltean/tsn-scripts

Just FYI, when packets are sent in-band with the schedule you get the following isochron report:

       Now: 1612270806.516733485
 Base time: 1612270808.300000000
Cycle time: 1.000000000
Collecting receiver stats
Accepted connection from 127.0.0.1
seqid 1 gate 1612270809.300000000 wakeup 1612270808.300037279 tx 1612270808.300118588 rx 1612270808.300122778 arrival 1612270808.300291857
seqid 2 gate 1612270810.300000000 wakeup 1612270809.300035836 tx 1612270809.300098918 rx 1612270809.300103043 arrival 1612270809.300241291
seqid 3 gate 1612270811.300000000 wakeup 1612270810.300043384 tx 1612270810.300106683 rx 1612270810.300110853 arrival 1612270810.300248119
seqid 4 gate 1612270812.300000000 wakeup 1612270811.300037082 tx 1612270811.300099688 rx 1612270811.300103868 arrival 1612270811.300242537
seqid 5 gate 1612270813.300000000 wakeup 1612270812.300042405 tx 1612270812.300104683 rx 1612270812.300108868 arrival 1612270812.300244900
seqid 6 gate 1612270814.300000000 wakeup 1612270813.300036355 tx 1612270813.300098728 rx 1612270813.300102858 arrival 1612270813.300238850
seqid 7 gate 1612270815.300000000 wakeup 1612270814.300041639 tx 1612270814.300104813 rx 1612270814.300108953 arrival 1612270814.300246934
seqid 8 gate 1612270816.300000000 wakeup 1612270815.300036135 tx 1612270815.300099393 rx 1612270815.300103528 arrival 1612270815.300240389
seqid 9 gate 1612270817.300000000 wakeup 1612270816.300035708 tx 1612270816.300098818 rx 1612270816.300103013 arrival 1612270816.300236442
seqid 10 gate 1612270818.300000000 wakeup 1612270817.300037335 tx 1612270817.300099443 rx 1612270817.300103618 arrival 1612270817.300225189

Summary:
Path delay: min 4125 max 4195 mean 4162.500 stddev 25.617, min at seqid 2, max at seqid 9
Wakeup to HW TX timestamp: min 62108 max 81309 mean 64659.700 stddev 5565.463, min at seqid 10, max at seqid 1
Latency budget: min 999877222 max 999897142 mean 999892862.000 stddev 5940.677, min at seqid 1, max at seqid 6
Wakeup latency: min 35708 max 43384 mean 38315.800 stddev 2802.886, min at seqid 9, max at seqid 3
Arrival latency (HW RX timestamp to application): min 121571 max 169079 mean 138512.800 stddev 11244.249, min at seqid 10, max at seqid 1
HW TX deadline misses: 0 (0.000%)

The reason why I posted these scripts is because it took me a while to get something going in terms of starting isochron only once the PTP clocks are synchronized. The scripts I came up with are kind of nasty, but they do work - I tested them on OpenIL 1.10 for LS1021A-TSN a few days ago. I also opened this discussion thread about having a better solution to wait until ptp4l and phc2sys are synchronized, but that didn't lead anywhere yet: https://sourceforge.net/p/linuxptp/mailman/linuxptp-users/thread/20210112180601.zcftc7kmjc3m4v5l%40skbuf/#msg37196011

I want that slaves clocks never go back in time. Never go back. How would I do that? --step_threshold? It is by default at 0.0.

The step_threshold option just enables clock stepping (time jumps) which can be either forward or backwards - no way to control it. Clock stepping is performed when the offset between the slave clock and the grandmaster is higher than this threshold. Which brings me to my main point: slave clocks follow the time of the grandmaster. So if you don't want slave clocks to go back in time, just stop making the grandmaster go back in time. If you have some kind of script which resets the time on the GM after each test, then stop doing that. Just set the time on the GM once.

I am using the OpenIL version 1.8 with an inner linuxptp version 2.0 in the LS1021ATSN platform, and a linuxptp version 3.1 in the boards. Does it influence in some bad behaviour?

There have been some fixes in upstream linuxptp since release tag 2.0, however nothing stands out as obviously relevant to me right now. By the way, OpenIL 1.10 integrates linuxptp 3.1.

In latest version of OpenIL, do you have solve errors for the LS1021ATSN platform? Do you recommend me to update my OpenIL OS?

Here is the list of kernel patches in between release tags OpenIL-v1.8-linux-202005 and OpenIL-v1.10-linux-202012), sorted in reverse chronological order (and some irrelevant/trivial patches removed):

0bd5363abc7c net: dsa: sja1105: poll for extts events from a timer
02d9d4085ff5 net: dsa: sja1105: fix tc-gate schedule with single element
ab9e89b5f93e net: dsa: sja1105: recalculate gating subschedule after deleting tc-gate rules
cc38999af796 net: dsa: sja1105: unconditionally free old gating config
8ff5cd39f001 net: dsa: sja1105: fix checks for VLAN state in gate action
4f24e2421cff net: dsa: sja1105: fix checks for VLAN state in redirect action
ac7f4b81bacc net: dsa: tag_8021q: stop restoring VLANs from bridge
4565567af0c2 net: dsa: sja1105: enable internal pull-down for RX_DV/CRS_DV/RX_CTL and RX_ER
0dad6b43afd8 net: dsa: sja1105: fix port mirroring for P/Q/R/S
adc93ad9ad0e net: dsa: sja1105: fix PTP timestamping with large tc-taprio cycles
9b51bf97c1af net: dsa: sja1105: offload the Credit-Based Shaper qdisc
da1075c01423 net: dsa: sja1105: request promiscuous mode for master
4976214678b3 net: dsa: allow drivers to request promiscuous mode on master
359e3150df30 net: dsa: sja1105: avoid invalid state in sja1105_vlan_filtering
2eb54d78875a docs: net: dsa: sja1105: document the best_effort_vlan_filtering option
ab3838b00166 net: dsa: sja1105: implement VLAN retagging for dsa_8021q sub-VLANs
841daf26e3f7 net: dsa: sja1105: implement a common frame memory partitioning function
a6058bad75ff net: dsa: sja1105: add packing ops for the Retagging Table
7441ca14aec9 net: dsa: sja1105: add a new best_effort_vlan_filtering devlink parameter
b7be3a8cd9a0 net: dsa: tag_sja1105: implement sub-VLAN decoding
4e14879736f3 net: dsa: tag_8021q: support up to 8 VLANs per port using sub-VLANs
9feda4edb6bb net: dsa: sja1105: prepare tagger for handling DSA tags and VLAN simultaneously
fcfe288ffc79 net: dsa: sja1105: exit sja1105_vlan_filtering when called multiple times
2e304da8f934 net: dsa: sja1105: save/restore VLANs using a delta commit method
3bd72c75802c net: dsa: sja1105: deny alterations of dsa_8021q VLANs from the bridge
3382628c4d2f net: dsa: sja1105: keep the VLAN awareness state in a driver variable
a0713e3365d0 net: dsa: tag_8021q: introduce a vid_is_dsa_8021q helper
aac931ec928b net: dsa: sja1105: Revert pre-mainline implementation of best_effort_vlan_filtering

So yes, there are some changes. In terms of bug fixes this is probably the most relevant, it first appeared in OpenIL 1.9:

adc93ad9ad0e net: dsa: sja1105: fix PTP timestamping with large tc-taprio cycles
diegotxegp commented 3 years ago

No at all. The same text appears.

No it doesn't, in the first log you had:

ptp4l[3526.872]: timed out while polling for tx timestamp
ptp4l[3526.872]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug

and in the second log you don't. But now I notice that the logs you're sharing are on some other device, not the LS1021A-TSN board?

The log is from one of the boards connected to the LS1021ATSN platform. In particular the board which acts as a master on the network I told you: master (node 1) -> bridge (LS1021ATSN) -> slave (node 2). In fact, the log of the LS1021ATSN is not alterated when the node 1 is.

it works perfectly with max offsets like 7-18ns until I run some tests where acceses to the time of the synchronized clock and suddenly this message raises

What are you doing exactly to break it? Can I see some logs on the actual board?

What I do it is a test which accesses 1 million times to REALTIME_CLOCK, where I have the PHC time mapped with phc2sys. Then, the node 1 loses sync until it recover about 10 seconds after.

Is it recommedable using the ptp4l.service on the LS1021ATSN switch and how would it be?

The /usr/lib/systemd/system/ptp4l.service systemd service that is preinstalled uses /etc/linuxptp.cfg which should provide a good starting point for an 802.1AS bridge on swp2-swp5. I don't know of any issues with it. It may not be what you need for your testing though.

For master (LS1021ATSN) --> slave (any node) I used the ptp4l daemon for each device like I told you in the first message. ''' With these instructions I synchronize the devices: LS1021ATSN sudo ptp4l -i swp2 -i swp2 -f /gPTP.cfg --tx_timestamp_timeout 20 -m (gPTP.cfg with gmCapable 0)

NODES sudo ptp4l -i enp3s0 -f /gPTP.cfg --tx_timestamp_timeout 20 -m ''' I did this because using ptp4l.service for this example, the LS1021ATSN delivered a different time to the PHC of the node, unlike the time I force to it using phc_ctl /dev/ptp1 set 1234432534.32432523 (for example). Some months ago, you told me in another converstation, that I must change the time of the LS1021ATSN because there was a error if I kept the epoch time (1970). In addition, using the ptp4l.service the offsets are higher. Conversely, For the other case, master (node 1) --> bridge (LS1021ATSN) --> slave (node 2), I used the ptp4l.service on the LS1021ATSN because it took the time from the master and deliver it to the salve with good offsets. If I use the ptp4l daemon as well.

Using /etc/linuxptp.cfg, does the 1021ATSN platform know that I am using gPTP.cfg in the other nodes or should I use /etc/gPTP.cfg in the LS1021ATSN platform? In the begining I used linuxptp.cfg for enable and disable the gmCapable option to use the LS1021ATSN platform, but after these problems of sync of loss I start using the /etc/gPTP.cfg to change the gmCapable option in case the previous way it was correct. Cast light on this, please. Remember that I want to use gPTP (802.1AS).

Is it better using the ptp4l daemon in each device (LS1021ATSN, board 1 and board 2) instead of the ptp4l.service?

Again, it depends what you need it to do. If by ptp4l daemon you mean starting/stopping ptp4l directly from a script, you could do that too.

Ok. I prefer doing the same mechanism for both situations: [1] slave (node 1) <-- master (LS1021) --> slave (node 2) and [2] master (node 1) --> bridge (LS1021ATSN) --> slave(node 2).

What steps would you do if you did that?

I might just happen to have some examples which might give you some ideas. Keep in mind that you'll probably need to adapt them heavily though - just use them as a starting point.

Board A ETH0 <-> Board B SWP2
Board A ETH1 <-> Board B SWP3

 +---------------------------------------------------+
 | LS1021A-TSN board B (bridge)                      |
 |                                                   |
 |   +-----------+   +-----------+   +-----------+   |
 |   |           |   |           |   |           |   |
 |   |           |   |           |   |           |   |
 |   |           |   |     +--------+|           |   |
 |   |    SWP5   |   |    SWP3   |  ||    ETH1   |   |
 |   +-----------+   +-----------+  |+-----------+   |
 |   |           |   |           |  ||           |   |
 |   |           |   |           |  ||           |   |
 |   |           |   |     +------+ ||           |   |
 |   |    SWP4   |   |    SWP2   || ||    ETH0   |   |
 +---------------+---------------+|-|------------+---+
                                  +-----------------------+
                                    |                     |
                                    +------------------+  |
                                                       |  |
 +---------------------------------------------------+ |  |
 | LS1021A-TSN board A (sender & receiver)           | |  |
 |                                                   | |  |
 |   +-----------+   +-----------+   +-----------+   | |  |
 |   |           |   |           |   |           |   | |  |
 |   |           |   |           |   |           |   | |  |
 |   |           |   |           |   |     +-----------+  |
 |   |    SWP5   |   |    SWP3   |   |    ETH1   |   |    |
 |   +-----------+   +-----------+   +-----------+   |    |
 |   |           |   |           |   |           |   |    |
 |   |           |   |           |   |           |   |    |
 |   |           |   |           |   |     +--------------+
 |   |    SWP4   |   |    SWP2   |   |    ETH0   |   |
 +---------------+---------------+---------------+---+

The point of this example is the measure the forwarding latency of the switch when the time-aware scheduler is enabled, and some packets are sent in-band with the schedule. OpenIL has a program available called "isochron" which can measure the packet latency by taking MAC-level hardware timestamps of those packets at the sender and at the receiver, and compare them (it assumes that the sender and receiver are synchronized over PTP). There is also the possibility that the isochron sender and receiver run on the same board, and this is precisely what the test does: it sets up an isochron sender on ETH0 of board A, a receiver on ETH1 of the same board, and make board B a time-aware bridge whose egress port is swp3.

What I did in the attached scripts is simply start/stop PTP on demand, and configure isochron to send 10 packets and measure their latency (I'll spare you the more subtle aspects of the test as they are probably not relevant to the topic which is just PTP).

Create these script files in a folder and transfer that folder to both boards: check_sync.sh:

#!/bin/bash

set -e -u -o pipefail

scrape_logs_for_phc2sys_offset() {
  local awk_program='/phc2sys/ { print $10; exit; }'

  journalctl -b -n 10 --no-pager > ptp.log

  echo $(cat ptp.log | awk "${awk_program}")
}

scrape_logs_for_ptp4l_offset() {
  local awk_program='/ptp4l/ { print $8; exit; }'

  journalctl -b -n 10 --no-pager > ptp.log

  echo $(cat ptp.log | awk "${awk_program}")
}

check_sync_phc2sys() {
  local threshold_ns=50
  local system_clock_offset=

  while :; do
      sleep 1

      system_clock_offset=$(scrape_logs_for_phc2sys_offset)

      # Got something, is it a number?
      case "${system_clock_offset}" in
      ''|[!\-][!0-9]*)
          if ! pidof phc2sys > /dev/null; then
              echo "Please start the phc2sys service."
              return 1
          else
              echo "No message from phc2sys, trying again..."
              continue
          fi
          ;;
      esac

      if [ "${system_clock_offset}" -lt 0 ]; then
          system_clock_offset=$((-${system_clock_offset}))
      fi
      echo "System clock offset ${system_clock_offset} ns"
      if [ "${system_clock_offset}" -gt "${threshold_ns}" ]; then
          echo "System clock is not yet synchronized..."
          continue
      fi
      # Success
      break
  done
}

check_sync_ptp4l() {
  local threshold_ns=100
  local phc_offset=

  while :; do
      sleep 1

      phc_offset=$(scrape_logs_for_ptp4l_offset)

      # Got something, is it a number?
      case "${phc_offset}" in
      ''|[!\-][!0-9]*)
          if ! pidof phc2sys > /dev/null; then
              echo "Please start the phc2sys service."
              return 1
          else
              echo "No message from ptp4l, trying again..."
              continue
          fi
          ;;
      esac

      echo "Master offset ${phc_offset} ns"
      if [ "${phc_offset}" -lt 0 ]; then
          phc_offset=$((-${phc_offset}))
      fi
      if [ "${phc_offset}" -gt "${threshold_ns}" ]; then
          echo "PTP clock is not yet synchronized..."
          continue
      fi
      # Success
      break
  done
}

check_sync_ptp4l
check_sync_phc2sys

ptp_start.sh:

#!/bin/bash

set -e -u -o pipefail

board=$@

schedule() {
  taskset $((1 << 0)) chrt -r 20 $@
}

restart_ptp() {
  # Start PTP time from 2021 and reset frequency
  killall ptp4l phc2sys || :

  case ${board} in
  1)
      schedule ptp4l -i eth0 -i eth1 -f /etc/ptp4l_cfg/gPTP.cfg \
          --gmCapable 0 --tx_timestamp_timeout 50 \
          --step_threshold 0.00002 --first_step_threshold 0.00002 &
      sleep 1
      schedule phc2sys -a -rr --transportSpecific 0x1 \
          --step_threshold 0.00002 --first_step_threshold 0.00002 &
      ;;
  2)
      # Set initial time somewhere in 2021 and run with that
      phc_ctl CLOCK_REALTIME set 1612270628.000000000 && phc_ctl CLOCK_REALTIME freq 0
      phc_ctl /dev/ptp1 set 1612270628.000000000 && phc_ctl /dev/ptp1 freq 0

      schedule ptp4l -i swp2 -i swp3 -f /etc/ptp4l_cfg/gPTP.cfg \
          --tx_timestamp_timeout 50 \
          --step_threshold 0.00002 --first_step_threshold 0.00002 &
      sleep 1
      schedule phc2sys -a -rr --transportSpecific 0x1 \
          --step_threshold 0.00002 --first_step_threshold 0.00002 &
      ;;
  *)
      echo "Usage: ptp_start.sh 1|2|3"
      exit 0
  esac
}

restart_ptp

test_taprio_2port.sh:

#!/bin/bash

export TOPDIR=$(cd "$(dirname "${BASH_SOURCE[0]}" )" && pwd)

do_cleanup() {
  echo "Cleaning up"
  killall ptp4l phc2sys
  ip link del dev br0
  tc qdisc del dev swp2 clsact
  tc qdisc del dev swp3 clsact
  tc qdisc del dev swp3 root
}
trap do_cleanup EXIT

systemctl disable --now systemd-networkd
systemctl disable --now ptp4l
systemctl disable --now phc2sys
# No NTP
systemctl disable --now systemd-timesyncd

devlink dev param set spi/spi0.1 name best_effort_vlan_filtering value true cmode runtime
ip link del br0

ip link set eth2 up
ip link set swp2 up
ip link set swp3 up
ip link set swp4 up
ip link set swp5 up

ip link add br0 type bridge vlan_filtering 1
ip link set swp2 master br0
ip link set swp3 master br0
ip link set swp4 master br0
ip link set swp5 master br0
ip link set br0 up
bridge vlan add dev swp2 vid 100 master
bridge vlan add dev swp3 vid 100 master
bridge vlan add dev br0 vid 100 self
tc qdisc add dev swp2 clsact
tc qdisc add dev swp3 clsact
tc qdisc add dev swp4 clsact
tc qdisc add dev swp5 clsact
tc qdisc replace dev swp3 parent root handle 100 taprio \
  num_tc 8 \
  map 0 1 2 3 4 5 6 7 \
  queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
  base-time 0 \
  sched-entry S 81 50000000 sched-entry S 82 50000000 \
  sched-entry S 84 50000000 sched-entry S 88 50000000 \
  sched-entry S 90 50000000 sched-entry S a0 50000000 \
  sched-entry S c0 50000000 sched-entry S 80 50000000 \
  sched-entry S 81 50000000 sched-entry S 81 50000000 \
  sched-entry S 81 50000000 sched-entry S 81 50000000 \
  sched-entry S 81 50000000 sched-entry S 81 50000000 \
  sched-entry S 81 50000000 sched-entry S 81 50000000 \
  sched-entry S 81 50000000 sched-entry S 81 50000000 \
  sched-entry S 81 50000000 sched-entry S 81 50000000 \
  flags 0x2
${TOPDIR}/ptp_start.sh 2
echo "Config done, waiting to be killed with Ctrl-C when test is done"
read

test_taprio_sender.sh:

#!/bin/bash

export TOPDIR=$(cd "$(dirname "${BASH_SOURCE[0]}" )" && pwd)

isochron() {
  taskset $((1 << 1)) ${TOPDIR}/isochron $@
}

do_cleanup() {
  echo "Cleaning up"
  killall ptp4l phc2sys
  killall isochron
}
trap do_cleanup EXIT

systemctl disable --now systemd-networkd
systemctl disable --now ptp4l
systemctl disable --now phc2sys
# No NTP
systemctl disable --now systemd-timesyncd
ip link del br0

ip link set eth0 up
ip link set eth1 up

${TOPDIR}/ptp_start.sh 1
${TOPDIR}/check_sync.sh

dmac=$(ip link show eth1 | awk '/link\/ether/ { print $2 }')

isochron rcv \
  --interface eth1 \
  --sched-rr \
  --sched-priority 98 \
  --etype 0xdead &

# base-time offset is aligned with the "a0" gate entry
isochron send \
  --interface eth0 \
  --dmac ${dmac} \
  --priority 5 \
  --base-time    0.300000000 \
  --advance-time 0.000000000 \
  --cycle-time   1.000000000 \
  --num-frames 10 \
  --frame-size 64 \
  --vid 100 \
  --client 127.0.0.1 \
  --etype 0xdead \
  --sched-rr \
  --sched-priority 98

The scripts should be run as follows:

# On board B:
./test_taprio_2port.sh
# On board A:
./test_taprio_sender.sh

by the way, if you do end up actually running this test, you'll have to compile the isochron program from source, from the master branch, since some of the options were newly added: https://github.com/vladimiroltean/tsn-scripts

Just FYI, when packets are sent in-band with the schedule you get the following isochron report:

       Now: 1612270806.516733485
 Base time: 1612270808.300000000
Cycle time: 1.000000000
Collecting receiver stats
Accepted connection from 127.0.0.1
seqid 1 gate 1612270809.300000000 wakeup 1612270808.300037279 tx 1612270808.300118588 rx 1612270808.300122778 arrival 1612270808.300291857
seqid 2 gate 1612270810.300000000 wakeup 1612270809.300035836 tx 1612270809.300098918 rx 1612270809.300103043 arrival 1612270809.300241291
seqid 3 gate 1612270811.300000000 wakeup 1612270810.300043384 tx 1612270810.300106683 rx 1612270810.300110853 arrival 1612270810.300248119
seqid 4 gate 1612270812.300000000 wakeup 1612270811.300037082 tx 1612270811.300099688 rx 1612270811.300103868 arrival 1612270811.300242537
seqid 5 gate 1612270813.300000000 wakeup 1612270812.300042405 tx 1612270812.300104683 rx 1612270812.300108868 arrival 1612270812.300244900
seqid 6 gate 1612270814.300000000 wakeup 1612270813.300036355 tx 1612270813.300098728 rx 1612270813.300102858 arrival 1612270813.300238850
seqid 7 gate 1612270815.300000000 wakeup 1612270814.300041639 tx 1612270814.300104813 rx 1612270814.300108953 arrival 1612270814.300246934
seqid 8 gate 1612270816.300000000 wakeup 1612270815.300036135 tx 1612270815.300099393 rx 1612270815.300103528 arrival 1612270815.300240389
seqid 9 gate 1612270817.300000000 wakeup 1612270816.300035708 tx 1612270816.300098818 rx 1612270816.300103013 arrival 1612270816.300236442
seqid 10 gate 1612270818.300000000 wakeup 1612270817.300037335 tx 1612270817.300099443 rx 1612270817.300103618 arrival 1612270817.300225189

Summary:
Path delay: min 4125 max 4195 mean 4162.500 stddev 25.617, min at seqid 2, max at seqid 9
Wakeup to HW TX timestamp: min 62108 max 81309 mean 64659.700 stddev 5565.463, min at seqid 10, max at seqid 1
Latency budget: min 999877222 max 999897142 mean 999892862.000 stddev 5940.677, min at seqid 1, max at seqid 6
Wakeup latency: min 35708 max 43384 mean 38315.800 stddev 2802.886, min at seqid 9, max at seqid 3
Arrival latency (HW RX timestamp to application): min 121571 max 169079 mean 138512.800 stddev 11244.249, min at seqid 10, max at seqid 1
HW TX deadline misses: 0 (0.000%)

The reason why I posted these scripts is because it took me a while to get something going in terms of starting isochron only once the PTP clocks are synchronized. The scripts I came up with are kind of nasty, but they do work - I tested them on OpenIL 1.10 for LS1021A-TSN a few days ago. I also opened this discussion thread about having a better solution to wait until ptp4l and phc2sys are synchronized, but that didn't lead anywhere yet: https://sourceforge.net/p/linuxptp/mailman/linuxptp-users/thread/20210112180601.zcftc7kmjc3m4v5l%40skbuf/#msg37196011

Thank you for the tests. First of all, I am gonna check this commands:

''' schedule ptp4l -i eth0 -i eth1 -f /etc/ptp4l_cfg/gPTP.cfg \ --gmCapable 0 --tx_timestamp_timeout 50 \ --step_threshold 0.00002 --first_step_threshold 0.00002 & sleep 1 schedule phc2sys -a -rr --transportSpecific 0x1 \ --step_threshold 0.00002 --first_step_threshold 0.00002 & '''

Question! ''' phc_ctl CLOCK_REALTIME set 1612270628.000000000 && phc_ctl CLOCK_REALTIME freq 0 phc_ctl /dev/ptp1 set 1612270628.000000000 && phc_ctl /dev/ptp1 freq 0 '''

And why do you set the same time to CLOCK_REALTIME and /dev/ptp1 if they will be the same when you use phc2sys a little below?

What does mean "freq 0"?

BTW. When we watch the log when the network is syncronized, it appears RMS, max offset, freq... What does freq mean? I can observe that each device has different values. Is not it the frequency that check/sync/poll the offset? Should not I be the same for everyone?

I want that slaves clocks never go back in time. Never go back. How would I do that? --step_threshold? It is by default at 0.0.

The step_threshold option just enables clock stepping (time jumps) which can be either forward or backwards - no way to control it. Clock stepping is performed when the offset between the slave clock and the grandmaster is higher than this threshold. Which brings me to my main point: slave clocks follow the time of the grandmaster. So if you don't want slave clocks to go back in time, just stop making the grandmaster go back in time. If you have some kind of script which resets the time on the GM after each test, then stop doing that. Just set the time on the GM once.

What I meant was that if in some moment the slave is ahead from the master (e.g. master 16:00:00, slave 16:00:02), the slave waits until master reach it and then synchronize as always. Not jump backward.

I am using the OpenIL version 1.8 with an inner linuxptp version 2.0 in the LS1021ATSN platform, and a linuxptp version 3.1 in the boards. Does it influence in some bad behaviour?

There have been some fixes in upstream linuxptp since release tag 2.0, however nothing stands out as obviously relevant to me right now. By the way, OpenIL 1.10 integrates linuxptp 3.1.

In latest version of OpenIL, do you have solve errors for the LS1021ATSN platform? Do you recommend me to update my OpenIL OS?

Here is the list of kernel patches in between release tags OpenIL-v1.8-linux-202005 and OpenIL-v1.10-linux-202012), sorted in reverse chronological order (and some irrelevant/trivial patches removed):

0bd5363abc7c net: dsa: sja1105: poll for extts events from a timer
02d9d4085ff5 net: dsa: sja1105: fix tc-gate schedule with single element
ab9e89b5f93e net: dsa: sja1105: recalculate gating subschedule after deleting tc-gate rules
cc38999af796 net: dsa: sja1105: unconditionally free old gating config
8ff5cd39f001 net: dsa: sja1105: fix checks for VLAN state in gate action
4f24e2421cff net: dsa: sja1105: fix checks for VLAN state in redirect action
ac7f4b81bacc net: dsa: tag_8021q: stop restoring VLANs from bridge
4565567af0c2 net: dsa: sja1105: enable internal pull-down for RX_DV/CRS_DV/RX_CTL and RX_ER
0dad6b43afd8 net: dsa: sja1105: fix port mirroring for P/Q/R/S
adc93ad9ad0e net: dsa: sja1105: fix PTP timestamping with large tc-taprio cycles
9b51bf97c1af net: dsa: sja1105: offload the Credit-Based Shaper qdisc
da1075c01423 net: dsa: sja1105: request promiscuous mode for master
4976214678b3 net: dsa: allow drivers to request promiscuous mode on master
359e3150df30 net: dsa: sja1105: avoid invalid state in sja1105_vlan_filtering
2eb54d78875a docs: net: dsa: sja1105: document the best_effort_vlan_filtering option
ab3838b00166 net: dsa: sja1105: implement VLAN retagging for dsa_8021q sub-VLANs
841daf26e3f7 net: dsa: sja1105: implement a common frame memory partitioning function
a6058bad75ff net: dsa: sja1105: add packing ops for the Retagging Table
7441ca14aec9 net: dsa: sja1105: add a new best_effort_vlan_filtering devlink parameter
b7be3a8cd9a0 net: dsa: tag_sja1105: implement sub-VLAN decoding
4e14879736f3 net: dsa: tag_8021q: support up to 8 VLANs per port using sub-VLANs
9feda4edb6bb net: dsa: sja1105: prepare tagger for handling DSA tags and VLAN simultaneously
fcfe288ffc79 net: dsa: sja1105: exit sja1105_vlan_filtering when called multiple times
2e304da8f934 net: dsa: sja1105: save/restore VLANs using a delta commit method
3bd72c75802c net: dsa: sja1105: deny alterations of dsa_8021q VLANs from the bridge
3382628c4d2f net: dsa: sja1105: keep the VLAN awareness state in a driver variable
a0713e3365d0 net: dsa: tag_8021q: introduce a vid_is_dsa_8021q helper
aac931ec928b net: dsa: sja1105: Revert pre-mainline implementation of best_effort_vlan_filtering

So yes, there are some changes. In terms of bug fixes this is probably the most relevant, it first appeared in OpenIL 1.9:

adc93ad9ad0e net: dsa: sja1105: fix PTP timestamping with large tc-taprio cycles

BTW. Could you tell me the difference among the configs (nxp_ls1021atsn_defconfig, nxp_ls1021atsn_ubuntu_defconfig and nxp_ls1021atsn_ubuntu_full_defconfig) of LS1021ATSN at the moment of compiling the sdcard.img file. I say this to campile again the OS and in the manual these configs are new and the description confusing.

Finally, I can tell you that if I synchronize the LS1021ATSN platform as a master and one board as a salve, in this case, if I run my tests, the sync is not lost. It seems it happens only when the LS1021ATSN is a bridge. I have to check with the parameters you said (the code below). Previously, I only used "--tx_timestamp_timeout 20".

''' --gmCapable 0 --tx_timestamp_timeout 50 \ --step_threshold 0.00002 --first_step_threshold 0.00002 '''

diegotxegp commented 3 years ago

Using the commands I could see in your code:

[NODE 2] schedule ptp4l -i enp3s0 -f /etc/ptp4l_cfg/gPTP.cfg \ --gmCapable 1 --tx_timestamp_timeout 50 \ --step_threshold 0.00002 --first_step_threshold 0.00002

[LS1021ATSN] phc_ctl CLOCK_REALTIME set 1612270628.000000000 && phc_ctl CLOCK_REALTIME freq 0 phc_ctl /dev/ptp1 set 1612270628.000000000 && phc_ctl /dev/ptp1 freq 0

schedule ptp4l -i swp2 -i swp3 -f /etc/ptp4l_cfg/gPTP.cfg \ --gmCapable 0 --tx_timestamp_timeout 50 \ --step_threshold 0.00002 --first_step_threshold 0.00002

[NODE 1]

schedule ptp4l -i enp3s0 -f /etc/ptp4l_cfg/gPTP.cfg \ --gmCapable 0 --tx_timestamp_timeout 50 \ --step_threshold 0.00002 --first_step_threshold 0.00002

And trying to change a little the values...

--tx_timestamp_timeout 1000 --step_threshold 1.0 --first_step_threshold 1.0

Still the problem happens. Everything works fine until I run my test where I do 1 million accesses to the CLOCK_REALTIME. When I change --tx_timestamp_timeout to 10000 instead of 50, one message appears with something like "Clockcheck: clock jumped backward or running slower than expected!"

ptp4l[1409.536]: rms 6 max 13 freq +18913 +/- 8 delay 234 +/- 0 ptp4l[1410.538]: rms 6 max 10 freq +18901 +/- 5 delay 234 +/- 0 ptp4l[1410.636]: timed out while polling for tx timestamp ptp4l[1410.637]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug ptp4l[1410.637]: port 1 (enp3s0): send peer delay request failed ptp4l[1410.637]: port 1 (enp3s0): SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED) ptp4l[1426.860]: port 1 (enp3s0): FAULTY to LISTENING on INIT_COMPLETE ptp4l[1429.963]: port 1 (enp3s0): new foreign master 00049f.fffe.ef0808-1 ptp4l[1430.226]: port 1 (enp3s0): LISTENING to MASTER on ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES ptp4l[1430.226]: selected local clock aabbcc.fffe.00094e as best master ptp4l[1430.226]: port 1 (enp3s0): assuming the grand master role ptp4l[1431.964]: selected best master clock aabbcc.fffe.00094a ptp4l[1431.964]: port 1 (enp3s0): MASTER to UNCALIBRATED on RS_SLAVE ptp4l[1432.953]: port 1 (enp3s0): UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED ptp4l[1433.830]: rms 38 max 65 freq +18837 +/- 19 delay 235 +/- 0 ptp4l[1434.833]: rms 8 max 15 freq +18869 +/- 11 delay 235 +/- 0

Thank you for your help, Vladimir.

diegotxegp commented 3 years ago

Excuse me, Vladimir. Do you have any published paper evaluating the LS1021ATSN?

vladimiroltean commented 3 years ago

Still the problem happens. Everything works fine until I run my test where I do 1 million accesses to the CLOCK_REALTIME. When I change --tx_timestamp_timeout to 10000 instead of 50, one message appears with something like "Clockcheck: clock jumped backward or running slower than expected!"

Could you show me how to reproduce the 1 million accesses to CLOCK_REALTIME?

Do you have any published paper evaluating the LS1021ATSN?

Nope, the documentation in the OpenIL user manual at https://openil.org/guide_list.html is all there is. If there's any analysis in particular that you would like to see, I guess you could ask that it gets included in next versions.

diegotxegp commented 3 years ago

Still the problem happens. Everything works fine until I run my test where I do 1 million accesses to the CLOCK_REALTIME. When I change --tx_timestamp_timeout to 10000 instead of 50, one message appears with something like "Clockcheck: clock jumped backward or running slower than expected!"

Could you show me how to reproduce the 1 million accesses to CLOCK_REALTIME?

I discovered that most of the fault is by the print-to-file execution. After creating an array with one million of accesses to the CLOCK_MONOTONIC and CLOCK_REALTIME, the ptp4l communication fails when I want to print all that information in a file. I am trying to rewrite the code to open and close the file more properly.

Do you have any published paper evaluating the LS1021ATSN?

Nope, the documentation in the OpenIL user manual at https://openil.org/guide_list.html is all there is. If there's any analysis in particular that you would like to see, I guess you could ask that it gets included in next versions.

Can I ask for it? To whom?

diegotxegp commented 3 years ago

Ah. And what is the difference among the versions of OpenIL; ls1021atsn_deconfig, ls1021atsn_ubuntu_deconfig and ls1021atsn_ubuntu_full_deconfig?

vladimiroltean commented 3 years ago

Can I ask for it? To whom?

Just ask here.

what is the difference among the versions of OpenIL; ls1021atsn_deconfig, ls1021atsn_ubuntu_deconfig and ls1021atsn_ubuntu_full_deconfig?

nxp_ls1021atsn_defconfig uses packages compiled from source using the native Buildroot makefiles nxp_ls1021atsn_ubuntu_defconfig and ls1021atsn_ubuntu_full_defconfig use a minimal rootfs skeleton with just the essential packages compiled by Buildroot, assembled together with an Ubuntu 18.04 rootfs for armv7 which includes the apt package manager. The full defconfig contains more apt packages preinstalled than the normal one.

In general I would recommend using the default nxp_ls1021atsn_defconfig build and using menuconfig there to build all the packages that you need. The building for the Ubuntu filesystems has been a bit flaky in the past.

diegotxegp commented 3 years ago

Thank you so much for your help, Vladimir.