Closed Jookia closed 1 year ago
I did some thinking about this while trying to sleep.
Using this gambling calculator here: http://www.beatingbonuses.com/calc_streak.htm
It took 4700 or so seconds to lose 4 clocks. A random packet loss of 15% would give me a 100% probability of this happening in the 1.5 hours I was running the printer. The actual loss is a lot smaller as this accumulates over multiple runs. Over a 12 hour period you need a random packet loss of 10% to hit this issue.
Over a year you would need a packet loss of 4% to hit this issue.
Looking at other retry commands we see that RetryAsyncCommand tries ~10 times, SerialRetryCommand tries ~6 times. We can plug these numbers in to get similar results.
With a 20% packet loss and 10 retries it would take a year to have a 10% chance of losing all retry attempts, and with 10% packet loss it would give a 0.01% chance.
Looking through the code:
All of this assumes the packet loss is randomly distributed. I have no reason to believe this, so the next step should be to gather some kind of DATA on this.
Ideally I would like to know:
All of this information comes from the host so I will probably need to instrument/log:
Then probably plot it or something.
I replaced Klipper's serial handling with code that just echos back what's sent. I've been running a program that sends random data of random lengths to the printer and checks what it gets back. So far I haven't had ANY errors.
Even when scoping the printer as it printed I didn't get significant noise that could even come close to causing a mis-read bit. As far as I'm concerned this has killed my transmission noise theory.
So the next question is: Is the MCU dropping the get_clock command sent to it, or is it dropping the clock response it sends?
We can tell the first by instrumenting the Klipper daemon process to log sends, receives, and re-sends, and also log re-tries of RPC.
I've been debugging this on and off for a few days now. I've done a lot of instrumentation with some very basic analysis. This week I will decide on a fairly long test print and test the following three variables conditionally to see what helps the most:
My working theory of what's happening here is that a low baud of 115200 or 57600 is bottlenecking the printer's responsiveness to TMC UART commands, and we send an excessive amount in the tmctrigorilla driver. This could fill up the command buffers in the device.
After a lot of research I've found that the problem here was the baud rate. I had to set it to 380400 to fix this issue.
Currently with my terrible UART I've managed to hit this situation twice when the clock tries to sync.
Losing 4 separate packets in a row is strange behaviour. But for something critical like clock synchronization we should probably try harder.
Increasing the clock synchronizations per second would work, but ruin debug logs as the backlog queue would just be clock garbage.
Maybe try 10 times a second unless we've had a response in the past 1 second?