jclark / rpi-cm4-ptp-guide

Guide to using the hardware PTP support in the Raspberry Pi CM4
MIT License
88 stars 10 forks source link

Wrong time after cable unplugged #3

Open jclark opened 2 years ago

jclark commented 2 years ago

If chrony with a PHC refclock and ts2phc are both running and the cable is unplugged for a time, the PHC will end up with the wrong time (many seconds out).

I have observed this with ts2phc -s generic. I am not sure if it happens with ts2phc -s nmea.

Attached is log showing what happens.

ptp-unplug.log

jclark commented 2 years ago

The kernel gets into a state where it delivers 4 extts events in a second:

Nov  7 20:09:45 ricotta ts2phc: [1911220.240] eth0 extts index 0 at 1667826586.175084690 corr 0 src 1667826622.93314787 diff -35824915310
Nov  7 20:09:45 ricotta ts2phc: [1911220.240] eth0 master offset -35824915310 s2 freq -100000000
Nov  7 20:09:45 ricotta ts2phc: [1911220.492] eth0 extts index 0 at 1667826586.175084690 corr 0 src 1667826622.345345651 diff -35824915310
Nov  7 20:09:45 ricotta ts2phc: [1911220.492] eth0 master offset -35824915310 s2 freq -100000000
Nov  7 20:09:45 ricotta ts2phc: [1911220.744] eth0 extts index 0 at 1667826586.175084690 corr 0 src 1667826623.597359645 diff -36824915310
Nov  7 20:09:45 ricotta ts2phc: [1911220.744] eth0 master offset -36824915310 s2 freq -100000000
Nov  7 20:09:45 ricotta ts2phc: [1911220.996] eth0 extts index 0 at 1667826586.175084690 corr 0 src 1667826623.849334398 diff -36824915310
Nov  7 20:09:45 ricotta ts2phc: [1911220.996] eth0 master offset -36824915310 s2 freq -100000000

This not surprisingly confuses ts2phc.

My hypotheses is that what is happening is that timestamps that were explicitly requested in order to read the hardware clock are being confused with extts timestamps. The driver schedules work 4 times a second to check whether there is an extts timestamp to report. It finds one every time, because there are timestamps that were explicitly requested but were leftover because didn't come in time (because of no carrier).

It should be possible to use bpftrace to verify this hypothesis.

jclark commented 2 years ago

This is exacerbated by the fact that the default value for step_threshold is 0, which means that ts2phc won't step the clock to correct the bad time.

We can alleviate this by having something like step_threshold 0.9, which allow ts2phc to recover quickly if something goes badly wrong.