Robot lurches while trying to balance

va3wam commented 3 years ago

Implement error counter logic as follows:

Create a reporting control variable that is set -1 = Report counter as it increments in real time, 0 = never report counter increments unless specifically asked, and other value is the number of seconds between timer updates to MQTT broker. Also, have an MQTT command that causes all health counters to be sent at once to the broker.

nerdoug commented 3 years ago

I've added code to display the cumulative counters for fault interrupts from the left and right DRV8825 controllers. On my TWIPe clone, the right DRV is generating fault about once a second (which is also the OLED refresh rate). The other DRV hasn't generated any faults yet. Right side faults are generated even if the bot isn't trying to balance. If you reset it while it's lying on his back, bring him up to 30 degrees from vertical where he clenches his wheels, then set him on his back again, the right side fault counter continues to climb.

Next steps: -swap left and right DRV's and see what happens. -recheck Vref setting on both DRV's - should be 0.85 V -Put a new DRV in to replace the one that's generating faults -investigate what s/w is doing to controller while bot is lying on its back -consider possible physical pressure by CPU console cable on DRV right below it

nerdoug commented 3 years ago

The right side DRV fault counter seems to increment whenever there is activity on either motor. I've verified Vref is 0.85 on both DRV's, swapped them and get same result: faults being counted for the right side DRV when either motor is activated. Need to verify the associations between DRV 1 & 2, left and right, and physical and GPIO pin numbers used for fault interrupts.

From circuit board silkscreen info, and continuity testing on the board:

DRV2 is the one closest to a corner of the circuit board
DRV2 connects to the motor on TWIPe's left (same as amber pushbutton) (above are correctly documented in SB7D-stepper-wiring.odg)
DRV2's fault pin connects to GPIO pin 32, physical pin 20 on the CPU
DRV1's fault pin connects to GPIO pin 13, physical pin 25 on the CPU (above is correct in huzzah32_pins.h) (I've added more info to sb7D-pinouts-CPU.ods)

reviewing the source code

the gp_DRV2_FAULT gpio pin is attached to the leftDRV8825fault ISR, which is correct
that ISR correctly increments health.leftDRVfault
above is correct for the DRV1 / right as well
the right eye display routine (actually in updateLED() ) outputs 3 numbers on the 4th line, separated by stars: percent time in MQTT routines, health.leftDRVfault, health.rightDRVfault
it's the last number that's incrementing, i.e. the one for DRV1, right side, GPIO pin 13, physical pin 25

So, the faults don't follow the physical DRV chip, but are always reported for DRV1. I don't see any software bugs that would cause incorrect fault counting. Thus I turn to hardware causes, and I see notations in several places that using GPIO 13 may conflict in some way with the onboard LED. I think 13 was used due to circuit board layout constraints before we started using a double-sided layout.

I'll investigate possible board mods to move the DRV1 fault line to a different CPU pin. Candidates are: GPIO 26, physical 5 GPIO 34, physical 7 (input only pin, but that's OK) GPIO 36, physical 9 (input only pin, but that's OK)

va3wam commented 3 years ago

Interesting. Here are my test results:

Reset robot Right display loop: varies 65, 66 or 67 (so loop execution at rest is pretty consistent) Other: varies between 8 and 9 MD: 0|0|0| (never increment)
Go to at least 30 degrees and hold the robot there, then return to. Its back LD: varies between 2 and 3 loop: varies 63, 64, 65 Other: varies between 6 and 7 MD: 0|0|x| where x keeps going up

So I guess I see the same thing you do. I agree with your instinct that 13 may be an issue. I think 13 is also used for serial stuff when loading code but I may be wrong there.

On Jan 15, 2021, at 8:20 PM, Doug Elliott VA3DAE notifications@github.com wrote:

The right side DRV fault counter seems to increment whenever there is activity on either motor. I've verified Vref is 0.85 on both DRV's, swapped them and get same result: faults being counted for the right side DRV when either motor is activated. Need to verify the associations between DRV 1 & 2, left and right, and physical and GPIO pin numbers used for fault interrupts.

From circuit board silkscreen info, and continuity testing on the board:

DRV2 is the one closest to a corner of the circuit board DRV2 connects to the motor on TWIPe's left (same as amber pushbutton) (above are correctly documented in SB7D-stepper-wiring.odg) DRV2's fault pin connects to GPIO pin 32, physical pin 20 on the CPU DRV1's fault pin connects to GPIO pin 13, physical pin 25 on the CPU (above is correct in huzzah32_pins.h) (I've added more info to sb7D-pinouts-CPU.ods) reviewing the source code

the gp_DRV2_FAULT gpio pin is attached to the leftDRV8825fault ISR, which is correct that ISR correctly increments health.leftDRVfault above is correct for the DRV1 / right as well the right eye display routine (actually in updateLED() ) outputs 3 numbers on the 4th line, separated by stars: percent time in MQTT routines, health.leftDRVfault, health.rightDRVfault it's the last number that's incrementing, i.e. the one for DRV1, right side, GPIO pin 13, physical pin 25 So, the faults don't follow the physical DRV chip, but are always reported for DRV1. I don't see any software bugs that would cause incorrect fault counting. Thus I turn to hardware causes, and I see notations in several places that using GPIO 13 may conflict in some way with the onboard LED. I think 13 was used due to circuit board layout constraints before we started using a double-sided layout.

I'll investigate possible board mods to move the DRV1 fault line to a different CPU pin. Candidates are: GPIO 26, physical 5 GPIO 34, physical 7 (input only pin, but that's OK) GPIO 36, physical 9 (input only pin, but that's OK)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/va3wam/TWIPe/issues/101#issuecomment-761283602, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7ZQV5JCH4W7TQ4JUHIZK3S2DSVTANCNFSM4WBJMQ7A.

va3wam commented 3 years ago

Mods from Doug have been applied to both robots. Initial testing looks good (at least the error counters do not climb any more). Need to be sure that error counts do not climb during an extended PID tuning session. The fix was to not use GPIO13. The Huzzah32 onboard LED circuitry may have been messing with electron routing.

nerdoug commented 3 years ago

I tried to reproduce lurching behaviour after fixing DRV1 fault / CPU LED conflict, and didn't really come to a conclusion. However, I came across another problem that could cause lurching behaviour, which is described in this email:

From: Doug Elliott canoe.eh@gmail.com Date: Mon, Jan 18, 2021 at 11:04 PM Subject: Twipe Performance To: Andrew Mitchell va3wam@gmail.com Cc: Doug Elliott canoe.eh@gmail.com

I was starting to look at Twipe' tendency to lurch occasionally, and captured the telemetry data from a balance run to a copy of the spreadsheet. The data shows a couple of times where the time between readIMU calls was much bigger than the expected 12 msec. I decoded the runbit info, and each time, updateLED, which is runbit 24, had just executed. When I looked at the timestamps, they were indeed a second apart. Except there was one similar case which was off by a half second. This vaguely rang a bell, and sure enough, in loop(), there's a routine call to update the network info in the left eye in parallel with calling updateLED()., every half second.

This might be a factor in our lurching behaviour, but I seem to remember that we saw lurching before I added the CPU usage display. Maybe the network display was enough to cause trouble? Anyway, I have some ideas on how to fix this, and will try to put some code together tomorrow. I'll attach the spreadsheet in case you want to poke around in it. The CPU display updates are in rows 20, 102, 184 (exactly 82 rows apart)

The netinfo displays happen at each of the above, plus at line 60, 142. (exactly 82 rows apart)

recording ideas before I forget them:

have netinfo and CPU display run in alternate half seconds
reduce complexity of both displays
- would constant width font reduce overhead to build the display?
see if there's a way to overwrite part of the OLED rather than complete rewrite
- if so, have smaller sequential display tasks, allowing IMU to be serviced between them
use FreeRTOS to give IMU routine priority with pre-emption

bal-0118-2208.zip

Cheers,

va3wam / TWIPe

Robot lurches while trying to balance #101

I tried to reproduce lurching behaviour after fixing DRV1 fault / CPU LED conflict, and didn't really come to a conclusion. However, I came across another problem that could cause lurching behaviour, which is described in this email: