i2c::I2c methods are unreliable in 'release' builds

LeonardMH commented 5 years ago

Environment

I am using a Texas Instruments TIVA-C LaunchPad (EK-TM4C123GXL). I have used the cortex-m-quickstart project to bootstrap.

This is my entire main.rs:

#![no_std]
#![no_main]

// you can put a breakpoint on `rust_begin_unwind` to catch panics
extern crate panic_halt;

use tm4c123x_hal::prelude::*;
use tm4c123x_hal::delay::Delay;
use cortex_m_rt::entry;

use tm4c123x_hal::i2c::I2c;

#[entry]
fn main() -> ! {
    let cpu_periph = cortex_m::Peripherals::take().unwrap();
    let dev_periph = tm4c123x::Peripherals::take().unwrap();

    // set up a delay controller based on systick
    let system_control = dev_periph.SYSCTL.constrain();
    let clocks = system_control.clock_setup.freeze();
    let mut delay = Delay::new(cpu_periph.SYST, &clocks);

    // Set up GPIOs that we're going to use
    let pf = dev_periph.GPIO_PORTF.split(&system_control.power_control);
    let mut red_led = pf.pf1.into_push_pull_output();
    let mut blu_led = pf.pf2.into_push_pull_output();
    let mut grn_led = pf.pf3.into_push_pull_output();

    // gpio banks needed for I2C
    let mut pa = dev_periph.GPIO_PORTA.split(&system_control.power_control);
    let mut pe = dev_periph.GPIO_PORTE.split(&system_control.power_control);

    // set up I2C1
    let scl1 = pa.pa6.into_af_push_pull(&mut pa.control);
    let sda1 = pa.pa7.into_af_open_drain(&mut pa.control);
    let mut i2c1 = I2c::i2c1(
        dev_periph.I2C1,
        (scl1, sda1),
        400.khz(),
        &clocks,
        &system_control.power_control,
    );

    // set up I2C2
    let scl2 = pe.pe4.into_af_push_pull(&mut pe.control);
    let sda2 = pe.pe5.into_af_open_drain(&mut pe.control);
    let mut i2c2 = I2c::i2c2(
        dev_periph.I2C2,
        (scl2, sda2),
        400.khz(),
        &clocks,
        &system_control.power_control,
    );

    loop {
        // do an i2c transaction forever
        match i2c2.write(0x3C, &[0, 0, 1, 2, 3]) {
            Ok(_) => { red_led.set_low(); }
            Err(_) => { red_led.set_high(); }
        };
    }
}

This is my .cargo/config:

[target.'cfg(all(target_arch = "arm", target_os = "none"))']
runner = "arm-none-eabi-gdb-py -q -x openocd.gdb"

rustflags = [
  # LLD (shipped with the Rust toolchain) is used as the default linker
  "-C", "link-arg=-Tlink.x",
]

[build]
target = "thumbv7em-none-eabihf" # Cortex-M4F and Cortex-M7F (with FPU)

This is the relevant section of my Cargo.toml:

[dependencies]
embedded-hal = "0.2.1"
cortex-m = "0.5.7"
cortex-m-rt = "0.6.3"
cortex-m-semihosting = "0.3.1"
panic-halt = "0.2.0"
tm4c123x = "0.7.0"
tm4c123x-hal = "0.6.0"

# this lets you use `cargo fix`!
[[bin]]
name = <removed>
test = false
bench = false

[profile.release]
opt-level = "s"
codegen-units = 1
debug = true # symbols are nice and they don't increase the size on Flash
lto = true

Problem

With the proper I2C lines instrumented, when I perform cargo clean && cargo run --release I see the following waveform for a single transaction (expect: S + ADDR + 0, 0, 1, 2, 3 + P):

screen shot 2018-11-02 at 18 53 54

It looks great, performance-wise, but it is missing the first byte! When I perform cargo clean && cargo run I see the following waveform for a single transaction:

screen shot 2018-11-02 at 19 08 05

It is poor from a performance standpoint, but gets all of the bytes in there.

Suggestions

I don't yet know for sure what is going on here. I'm relatively new to rust and llvm (this feels like maybe something llvm is doing). The I2C code in tm4c123x_hal::i2c appears to be correct based on comparison with other C-based Launchpad projects which I know are working.

The only lead I have right now will be to start looking at the I2C peripheral. I have seen issues in the past on this peripheral where checking the busy bit too soon after setting the run bit would result in the busy bit not being set by the time that I checked it. I've never been able to get an answer on how many cycles are actually required between setting the run bit and checking the busy bit, but right now I suspect that to be the issue.

I will be playing around with this some more this weekend and if I come across anything I will update, in the meantime if anyone has suggestions, they would be very much appreciated.

LeonardMH commented 5 years ago

I think I have proven that the issue is due to the busy bit being checked too quickly. I have vendored this library so I can edit it and made the following change:

diff --git a/vendor/tm4c123x-hal/src/i2c.rs b/vendor/tm4c123x-hal/src/i2c.rs
index 6772aa3..d39ec3a 100644
--- a/vendor/tm4c123x-hal/src/i2c.rs
+++ b/vendor/tm4c123x-hal/src/i2c.rs
@@ -1,5 +1,7 @@
 //! Inter-Integrated Circuit (I2C) bus

+use cortex_m::asm::delay;
+
 use tm4c123x::{I2C0, I2C1, I2C2, I2C3};

 use crate::gpio::gpioa::{PA6, PA7};
@@ -147,6 +149,8 @@ macro_rules! hal {
                             .run().set_bit()
                     });

+                    delay(5000);
+
                     for (i,byte) in (&bytes[1..]).iter().enumerate() {
                         busy_wait!(self.i2c, busy, bit_is_clear);

With this change, this is what I see on the wire:

Obviously, that delay is ridiculous, but I was still seeing bad results with lower delay values like 10, or 100. Searching for a lower bound now.

LeonardMH commented 5 years ago

It appears that a single nop is sufficient here (although maybe that is dependent on the slave device). My previous issues with using small delay values were just due to me not correctly reloading the binary between test runs. Current diff has changed to:

diff --git a/vendor/tm4c123x-hal/src/i2c.rs b/vendor/tm4c123x-hal/src/i2c.rs
index 6772aa3..c002347 100644
--- a/vendor/tm4c123x-hal/src/i2c.rs
+++ b/vendor/tm4c123x-hal/src/i2c.rs
@@ -1,5 +1,7 @@
 //! Inter-Integrated Circuit (I2C) bus

+use cortex_m::asm::nop;
+
 use tm4c123x::{I2C0, I2C1, I2C2, I2C3};

 use crate::gpio::gpioa::{PA6, PA7};
@@ -147,6 +149,8 @@ macro_rules! hal {
                             .run().set_bit()
                     });

+                    nop();
+
                     for (i,byte) in (&bytes[1..]).iter().enumerate() {
                         busy_wait!(self.i2c, busy, bit_is_clear);

I have also changed my write buffer to [0x55, 0, 1, 2, 3] to better distinguish the first byte from the second. This results in the following trace:

In this trace there is a ~50-100us delay between the ADDR + DATA0 phase and the DATA1 byte that didn't exist in my original 'release' trace; I think that may just be clock stretching by the device that I am talking to (I have changed slave devices since my original report).

If I remove the nop instruction I lose the first data byte, and the second data byte is delivered in its place, as you can see here:

The delay between ADDR + DATA0 and DATA1 is smaller in this case, but again, I think that is just a function of the device I am talking to responding differently to the DATA0 value.

So, @thejpster from a rust/embedded_hal perspective what is the right way to handle this?

LeonardMH commented 5 years ago

This appears to affect all of the i2c methods in one way or another. I have moved the nop instruction to be the top line of the busy_wait macro and that appears to give me much better results. I'm not seeing any performance hit from this, but I think my slave device is the limiting factor right now. I will have to test with a more performant device to fully characterize the impact.

thejpster commented 5 years ago

Thank you for the very detail diagnosis. Very impressive!

It sounds like adding a no-op to the busy wait routine is the cleanest fix.

LeonardMH commented 5 years ago

Thanks, I know the pain of vague bug reports and try not to inflict that on others 😃.

I agree that moving the nop into busy_wait is the best choice. I'll open a PR later this week when I get a chance to do some more testing with a faster slave device.

I also wonder how/if the delay required would be affected by higher I2C clock speeds. I'm trying to run at 400kHz but seeing something more like to 180kHz on the wire I think due to the TPR calculation being very conservative, I may try hand tweaking that and retesting. There's also always the possibility that someone is using the "fast plus" mode (1MHz) or the "high speed" mode (3.4Mhz), but I don't have any slave devices that operate at those speeds... I don't expect that the I2C clock speed is actually a factor anyways.

On another note, where is a good place to ask questions related to this crate (+ the tm4c123x crate)? I have a few:

Documentation contribution

I'd like to contribute some documentation, specifically around:

Configuring GPIOs: All of the existing documentation just points to the STMicro crate as an example of how to do this, but the APIs are different enough that this isn't really all that helpful.
Configuring I2C: I can't find documentation or examples of how to do this anywhere with any of the embedded_hal crates. I eventually stumbled my way through the HAL sources while referring to some existing C code I had and ended up with what I believe is the 'right' way to do it. Ultimately this probably ended up being a useful exercise, but it was also frustrating and I was fortunately feeling obstinate or would have given up before realizing the simplicity of the way it is supposed to be done.

What would be the best way to contribute this? Add to the README? Expand the rustdoc comments in the sources?

Interrupt handlers

Is it at all possible to use the interrupt! macro to mark a function as an interrupt handler? I tried every method I could think of to import this from the tm4c123x crate but rust couldn't find it. I also can't find this in the docs, I had to refer to the source code.

I2C clock recovery

I'll file another issue for this, since I'm starting to think this is an actual crate issue, and the correct fix will require some discussion.

In my testing, if I power down the slave device while the TIVA is in the middle of a transaction the SCL line is held low, restoring power to the slave device doesn't help anything. I've confirmed that the TIVA is the one holding the line and that it is stuck attempting to start a new transaction but continuously erroring out. I have a few potential solutions, but nothing clean yet.

Generic resource sharing (a.k.a generic I2C/GPIO types)

Ehh, this is kind of a big question and needs code samples so I'll open another issue for this one.

EDIT: Got this figured out at least for I2C/SPI, GPIOs should be similar I'm sure, so probably nothing to do here. I just needed a little bit better understanding of rust Traits/Generics/Macros.

thejpster commented 5 years ago

where is a good place to ask questions related to this crate?

Feel free to open issues, on this crate or on https://github.com/m-labs/dslite2svd.

Is it at all possible to use the interrupt! macro to mark a function as an interrupt handler?

Yes, see https://github.com/thejpster/monotron for an example. Look for TIMER1A and TIMER1B.

LeonardMH commented 5 years ago

Yes, see https://github.com/thejpster/monotron for an example. Look for TIMER1A and TIMER1B.

Awesome, thanks! I was hoping to get SPI and UART working this weekend as well so there is a treasure trove of information there.

I still haven't been able to solve this question on how to use generic GPIOs, and referring to the monotron source I see you have the same problem, for example:

struct Joystick {
    up: hal::gpio::gpioc::PC6<hal::gpio::Input<hal::gpio::PullUp>>,
    down: hal::gpio::gpioc::PC7<hal::gpio::Input<hal::gpio::PullUp>>,
    left: hal::gpio::gpiod::PD6<hal::gpio::Input<hal::gpio::PullUp>>,
    right: hal::gpio::gpiod::PD7<hal::gpio::Input<hal::gpio::PullUp>>,
    fire: hal::gpio::gpiof::PF4<hal::gpio::Input<hal::gpio::PullUp>>,
}

This is what I have been doing as well, but it feels suboptimal. I want my driver (in this case Joystick) to be completely generic on the type over a certain GPIO configuration. The first syntax I tried (and what I would consider "nice") was something like:

use hal::gpio;
type PullUpInput = gpio::Input<gpio::PullUp>;

struct Joystick {
    up: PullUpInput,
    down: PullUpInput,
    ...
}

impl Joystick {
    fn new(up: PullUpInput, down: PullUpInput, ...) -> Joystick {
        Joystick { up, down }
    }
}

And then let's say you would use this with:

let pe = dp.GPIO_PORTE.split(&sysctl.power_control);
let mut joy = Joystick::new(pe.pe4.into_pull_up_input(), pe.pe5.into_pull_up_input();

But if you wanted to, you could also do this instead without any need to change the definition of Joystick:

let pa = pa.GPIO_PORTA.split(&sysctl.power_control);
let mut joy = Joystick::new(pa.pa5.into_pull_up_input(), pa.pa6.into_pull_up_input();

But the compiler doesn't let me do this, it complains that gpio::Input<gpio::PullUp> doesn't implement the Sized trait. I also tried implemented this completely generically with something like (sorry going off of memory):

use embedded_hal::digital;
struct Joystick<UP, DN> {
    up: UP,
    down: DN,
    ...
}

impl <UP, DN> Joystick<UP, DN> 
    where UP: digital::InputPin,
          DN: digital::InputPin,
{
    ...
}

But this also didn't work (I can't remember why right now, a similar mechanism works for I2C and SPI just fine), it was also starting to quickly get unwieldy so I gave up.

I'm essentially trying to use the TIVA to interface to two different pieces of hardware that have the same top-level behavior, but use different pinouts for I2C, SPI, and some GPIOs. Being able to specify the hardware layout at one level and then never think about it again would be really nice, but I haven't been able to figure out how to do this for GPIOs so I've resorted to compile time options. Is it possible to do what I'm wanting to do?

thejpster commented 5 years ago

Yes, the traits are unsized - you can only use trait references, or concrete type T that implements some trait. I think your last example is the right way to go, but InputPin needs a <MODE> or something, with an appropriate qualifier on MODE.

thejpster commented 5 years ago

I'll open a PR later this week.

Just to follow up on this, do you still plan to submit a PR, or shall I open one?

LeonardMH commented 5 years ago

I still plan to submit a PR. We’re seeing this exact issue on a different TI part and I’m working on getting a more official recommendation on how much delay is needed between setting the RUN bit and checking the BUSY bit.

EDIT: With that said, if you’re trying to bundle fixes for a release then you can probably just go ahead with the single nop as it seems to work fine in my testing so far.

LeonardMH commented 5 years ago

@thejpster, I'm told that 8 clock cycles between setting the RUN bit and checking the BUSY bit should be sufficient. Working on a PR now.

LeonardMH commented 5 years ago

PR https://github.com/thejpster/tm4c-hal/pull/19 opened to resolve this issue.

thejpster commented 5 years ago

PR #19 merged

rust-embedded-community / tm4c-hal