Improve clock retransmissions

Jookia commented 1 year ago

Currently with my terrible UART I've managed to hit this situation twice when the clock tries to sync.

Sent 39 4698.500472 4698.499430 6: seq: 1b, get_clock <- gets a response
Sent 49 4699.484563 4699.483270 6: seq: 15, get_clock <- no response
Sent 69 4700.488047 4700.485170 6: seq: 19, get_clock <- no response
Sent 79 4701.495409 4701.491140 6: seq: 13, get_clock <- no response
Sent 90 4702.493967 4702.492063 6: seq: 1e, get_clock <- no response
Sent 98 4703.477554 4703.476512 6: seq: 16, get_clock <- response, but it's too late

Losing 4 separate packets in a row is strange behaviour. But for something critical like clock synchronization we should probably try harder.

Increasing the clock synchronizations per second would work, but ruin debug logs as the backlog queue would just be clock garbage.

Maybe try 10 times a second unless we've had a response in the past 1 second?

Jookia commented 1 year ago

I did some thinking about this while trying to sleep.

Using this gambling calculator here: http://www.beatingbonuses.com/calc_streak.htm

It took 4700 or so seconds to lose 4 clocks. A random packet loss of 15% would give me a 100% probability of this happening in the 1.5 hours I was running the printer. The actual loss is a lot smaller as this accumulates over multiple runs. Over a 12 hour period you need a random packet loss of 10% to hit this issue.

Over a year you would need a packet loss of 4% to hit this issue.

Looking at other retry commands we see that RetryAsyncCommand tries ~10 times, SerialRetryCommand tries ~6 times. We can plug these numbers in to get similar results.

With a 20% packet loss and 10 retries it would take a year to have a 10% chance of losing all retry attempts, and with 10% packet loss it would give a 0.01% chance.

Looking through the code:

Clock retry takes a second between retries for a total of 5 seconds.
RetryAsyncCommand takes 0.5 seconds between retries for a total of 5 seconds.
SerialRetryCommand takes 0.01 seconds between retries for a total of 0.5 seconds.

All of this assumes the packet loss is randomly distributed. I have no reason to believe this, so the next step should be to gather some kind of DATA on this.

Ideally I would like to know:

How often I have to retry a packet
What times these retries happen
Other activity around the packets?

All of this information comes from the host so I will probably need to instrument/log:

Packet send time
Whether the packet is a resend due to no ACK
Whether the packet is a resend due to no response in RPC

Then probably plot it or something.

Jookia commented 1 year ago

I replaced Klipper's serial handling with code that just echos back what's sent. I've been running a program that sends random data of random lengths to the printer and checks what it gets back. So far I haven't had ANY errors.

Even when scoping the printer as it printed I didn't get significant noise that could even come close to causing a mis-read bit. As far as I'm concerned this has killed my transmission noise theory.

So the next question is: Is the MCU dropping the get_clock command sent to it, or is it dropping the clock response it sends?

We can tell the first by instrumenting the Klipper daemon process to log sends, receives, and re-sends, and also log re-tries of RPC.

Jookia commented 1 year ago

I've been debugging this on and off for a few days now. I've done a lot of instrumentation with some very basic analysis. This week I will decide on a fairly long test print and test the following three variables conditionally to see what helps the most:

Faster baud
Bigger TX and RX buffers on the MCU
Less tmctrigorilla retries

My working theory of what's happening here is that a low baud of 115200 or 57600 is bottlenecking the printer's responsiveness to TMC UART commands, and we send an excessive amount in the tmctrigorilla driver. This could fill up the command buffers in the device.

Jookia commented 1 year ago

After a lot of research I've found that the problem here was the baud rate. I had to set it to 380400 to fix this issue.

printers-for-people / catboat

Improve clock retransmissions #20