Motor2040 encoders intermittently return infinite speed readings in micropython

pimoroni / pimoroni-pico

Libraries and examples to support Pimoroni Pico add-ons in C++ and MicroPython.

https://shop.pimoroni.com/collections/pico

MIT License

1.33k stars 500 forks source link

Motor2040 encoders intermittently return infinite speed readings in micropython #579

Open tomoinn opened 2 years ago

tomoinn commented 2 years ago

In micropython, calling the capture() method on an Encoder can sometimes return an infinite speed in the radians_per_second field. This is obviously a little inconvenient when using it to drive the input of a PID controller. Most readings are sensible, but after normally around 30s of use I get one that comes back as infinity.

I'd suggest it would be better to either return None and document that this might happen, or return the previous valid value. As it is I've done the latter in my code, but it took a long time to track down!

tomoinn commented 2 years ago

This is using a motor2040 board with the latest build of micropython and a 110 ratio micro metal gear motor, if that makes any difference. The result of a capture when this fails is:

(count=-5857, delta=-1, frequency=-inf, revolutions=-1.10928, degrees=-399.3409, radians=-6.969814, revolutions_delta=-0.0001893939, degrees_delta=-0.06818181, radians_delta=-0.001189997, revolutions_per_second=-inf, revolutions_per_minute=-inf, degrees_per_second=-inf, radians_per_second=-inf)

ZodiusInfuser commented 2 years ago

Hi @tomoinn. Apologies for the late response. I've only just has this issue raised with me.

This is puzzling, as the capture should never return an infinity, and I never encountered one when writing the PID examples for motor2040 and inventor2040. Any chance you can share your code with me that produced this issue?

Below is the internal C++ code that creates the capture. As you can see, there is a check that is meant to handle a divide by zero. That wouldn't necessarily handle an infinity though, but I can't see why an infinity would be generated in the first place. Perhaps your fresh eyes can reveal something?

Encoder::Capture Encoder::capture() {
  // Take a capture of the current values
  int32_t count = enc_count;
  int32_t cumulative_time = enc_cumulative_time;
  enc_cumulative_time = 0;

  // Determine the change in counts since the last capture was taken
  int32_t change = count - last_capture_count;
  last_capture_count = count;

  // Calculate the average frequency of steps
  float frequency = 0.0f;
  if(change != 0 && cumulative_time != INT32_MAX) {
    frequency = (clocks_per_time * (float)change) / (float)cumulative_time;
  }

  return Capture(count, change, frequency, enc_counts_per_rev);
}

From: https://github.com/pimoroni/pimoroni-pico/blob/main/drivers/encoder/encoder.cpp

tomoinn commented 2 years ago

Hey - so, I'm not sure it's actually possible for cumulative_time to be zero, but your check isn't preventing a divide by zero, it's doing a/b and checking whether a is zero, it's checking whether b is INT32_MAX. So I guess cumulative_time is coming up as zero for some reason?

The code is very simple, it is using asyncio although I can't imagine that's making much of a difference, but the bug only manifests after a few minutes of runtime.

The relevant bit of code is as follows, it's slightly complicated by having the option to set either speed in radians/s or power, enabling or disabling the PID loop accordingly, but as you can see it's not doing anything very surprising when it comes to calling the capture method:

  def _update_current_speed(self):
      # Get current speed information from the encoder
      capture = self.encoder.capture()
      new_current_speed = -capture.radians_per_second if self.inverted else capture.radians_per_second
      if math.isinf(new_current_speed) or math.isnan(new_current_speed):
          print(f'warning, infinite or nan speed reported from encoder, capture was {capture}')
      else:
          self.current_speed = new_current_speed

  def tick(self):
      """
      Call repeatedly to update speeds on the motor based on the PID model, if active. If a target speed is not set
      but power is then the value of power will be used to drive the motor directly, bypassing the PID system.
      """
      if self._speed is None:
          if self._power is not None:
              # Directly drive the
              self.motor.speed(self._power)
              self._update_current_speed()
          # Short circuit the tick if we've got an explicit power set
          return
      if self.enabled:
          # Get current speed information from the encoder
          self._update_current_speed()
          # Configure the PID setpoint from our target speed value
          self.pid.setpoint = self._speed
          # Update the PID controller
          updated, accel = self.pid(self.current_speed)
          if updated:
              current_motor = self.motor.speed()
              if self.logging:
                  print(f'power={current_motor}, capture={self.current_speed}, setpoint={self.pid.setpoint}, ',
                        f'coefs={self.pid.components}, accel={accel}')
              self.motor.speed(current_motor + accel)
      else:
          self.motor.speed(0)

This code effectively wraps a Motor and Encoder along with a PID control for angular velocity, with the check to avoid accidentally passing an inf into the PID loop it works beautifully, so I'm not too worried right now. It is an odd bug though!

tomoinn commented 2 years ago

The tick method above is called on each of four motor / encoder pairs, within an asyncio loop. I don't imagine this makes any difference though!

ZodiusInfuser commented 2 years ago

Ha! I just spotted that bit about the check not explicitly dealing with cumulative_time being zero.

I can understand why only A is checked, as if there is no count change there will be no cumulative_time. That also means that a zero cumulative_time would have a zero count change too, so the check could be changed without any impact. I'm not in a position to test this on hardware currently, and was going to suggest you try, but it seems our Github actions (that generate our micropython builds) is borked (https://github.com/pimoroni/pimoroni-pico/actions/runs/3594561239/jobs/6052982497), so we'll have to try another day.

It could well be something to do with asyncio. I am not too familiar with how that module works (whether it's interrupt based or second core). If the latter, then it's entirely possible that there's the random chance that only one of the two encoder parameters has been updated when the capture is requested. Maybe, i'm clutching at straws though.

Also, how frequently is your ticks function getting called?

tomoinn commented 2 years ago

So, I've not looked at the rp2040 micropython implementation of asyncio, but I don't believe it would be using the second core - the programming model is a single thread that schedules tasks in a cooperative fashion, and the call to create the loop blocks. It shouldn't need any interrupt logic either, there are explicit await statements in the code that return control back to the scheduler.

The ticks are called reasonably fast (2000 per second in theory for each encoder, I'm using all four). That's the potential maximum rate, as there's an 'await asyncio.sleep(1/2000)' in the loop, but any execution time for the actual tick would reduce that. It's not super precise as the PID model has its own 'only update if time delta > x' logic.

One thing I've noticed is that some patterns of activity trigger the error faster than others. In 'normal' use just driving the thing around (this is a four motor chassis) it triggers after about ten minutes, but if I run a specific PID tuning code that sets maximum target speeds and mess up the PID tune such that it self oscillates (so going from max speed to min and back as fast as mechanically possible) it triggers it almost immediately most times. That shouldn't be changing anything in the software, but it will change the actual readouts from the encoder. I've no idea why that would make a difference, but I wonder if there's a zero crossing issue somewhere? That impulse test is causing the motor to change directions very fast.

tomoinn commented 2 years ago

(I'm being cagey about showing you the entire code because it's for a contract and under NDA, but I'm hoping to get them to open source at least this part of it, at least as an example of how to do a motor / encoder / PID in a robust fashion)

ZodiusInfuser commented 2 years ago

Thanks for the explanation of asyncio, and understandable about the contract thing.

I'll admit calling capture 2000 times per second is faster than I ever tested, and well beyond what I imagine to be practical for 110:1 micro metal gear motors. This is fine for getting your count and delta, but with just 1 or 2 ticks the calculated frequency/speed estimate will be quite noisy.

The only thing I can think is that calling capture that regularly, with the many interrupts of 4x motors oscillating, means there's a higher chance of a condition where enc_cumulative_time is zero when enc_count has changed.

Encoder::Capture Encoder::capture() {
  // Take a capture of the current values
  int32_t count = enc_count;
  int32_t cumulative_time = enc_cumulative_time;
  enc_cumulative_time = 0;
  ...

There are a number of instructions gap between enc_count being incremented and enc_cumulative_time being updated when the interrupt is called, so it could be possible.

void Encoder::microstep(int32_t time, bool up) {
  if(up) {
    enc_count++;
    if(++enc_step >= (int16_t)enc_counts_per_rev) {
      enc_step -= (int16_t)enc_counts_per_rev;
      enc_turn++;
    }
  }
  else {
    enc_count--;
    if(--enc_step < 0) {
      enc_step += (int16_t)enc_counts_per_rev;
      enc_turn--;
    }
  }
  microstep_time = 0;

  if(time + enc_cumulative_time < time)  // Check to avoid integer overflow
    enc_cumulative_time = INT32_MAX;
  else
    enc_cumulative_time += time;
}

Otherwise, under normal operation there is no way for cumulative_time to be zero if the count has changed. Even with the fastest detectable pulse the value would be greater than zero because of the debounce in the PIO. int32_t time_received = (received & TIME_MASK) + ENC_DEBOUNCE_TIME;

Assuming this theory is correct, what would the correct value be for the frequency/speed etc? There was a count change since the last capture, so zero would be as wrong as infinity, but would it be more stable for any code trying to use it?

Alternatively, I could make capture check for enc_cumulative_time being zero and try reading the value a few more times? Only falling back to zero frequency if that fails? I wouldn't put an open-ended loop in there.

What are your thoughts?

tomoinn commented 2 years ago

So if this is a race condition where I'm reading everything inbetween updates to count and cumulative_time it might make sense to re-order those in the code such that cumulative time is updated first? I'd also be fine with returning a 'go away and wait' value of some kind under those conditions, but crucially not resetting counts in that case. So sometimes you'd call it and it would come back with 'not enough information to give you a frequency', and you'd just have to deal with that (use previous value, or wait and ask again)

tomoinn commented 2 years ago

In effect I'm treating the inf as that 'go away I'm not ready' value in my current code, and that works. NaN would maybe be more explainable though.

tomoinn commented 2 years ago

I don't think returning a valid, but incorrect, value is the way to go