[i2c] NAK detection in Fast-mode Plus

engdoreis commented 1 year ago

Description

Introduction

I2C host is erroneously detecting NAK when communicating to the temperature sensor HDC1080 in Fast-Speed Plus mode (1MHz). On the image below, the host sends the address to the device (0x40) with read bit and receives an ACK. However, the NAK IRQ goes high (INTR_STATE.nak) indicating that a NAK was detected.

This could be related to the fact that the i2c is configured to 1Mhz but the SCL line is actually at 277kHz as shown in the image below, this is described in the issue #18492

How to reproduce:

Checkout the PR
Prepare an FPGA CW310 with a BoB connected to the PMOD.

Run the test i2c_host_hdc1080_humidity_temp_test with the command:

bazel test --define DISABLE_VERILATOR_BUILD=true --define bitstream=gcp_splice --test_tag_filters=-verilator,-dv,-broken --build_tests_only //sw/device/tests:i2c_host_hdc1080_humidity_temp_test --cache_test_results=no

Output


I00002 test_rom.c:243] Test ROM complete, jumping to flash (addr: 20000480)!
I00000 ottf_main.c:110] Running sw/device/tests/i2c_host_hdc1080_humidity_temp_test.c
I00001 i2c_testutils.c:270] Setting i2c to Standard (100kHz) mode.
I00002 i2c_testutils.c:302] period:400 nanos, cycles={fall=1, rise=1, hi=11, lo=12}
I00003 i2c_host_hdc1080_humidity_temp_test.c:131] Starting test read_manufacture_id...
I00004 i2c_host_hdc1080_humidity_temp_test.c:131] Successfully finished test read_manufacture_id in 7374 cycles or 737 us @ 10 MHz.
I00005 i2c_host_hdc1080_humidity_temp_test.c:132] Starting test read_product_id...
I00006 i2c_host_hdc1080_humidity_temp_test.c:132] Successfully finished test read_product_id in 7282 cycles or 728 us @ 10 MHz.
I00007 i2c_host_hdc1080_humidity_temp_test.c:133] Starting test read_temperature...
I00008 i2c_host_hdc1080_humidity_temp_test.c:133] Successfully finished test read_temperature in 136313 cycles or 13631 us @ 10 MHz.
I00009 i2c_host_hdc1080_humidity_temp_test.c:134] Starting test read_humidity...
I00010 i2c_host_hdc1080_humidity_temp_test.c:134] Successfully finished test read_humidity in 79810 cycles or 7981 us @ 10 MHz.
I00011 i2c_testutils.c:274] Setting i2c to Fast (400kHz) mode.
I00012 i2c_testutils.c:302] period:400 nanos, cycles={fall=1, rise=1, hi=2, lo=4}
I00013 i2c_host_hdc1080_humidity_temp_test.c:131] Starting test read_manufacture_id...
I00014 i2c_host_hdc1080_humidity_temp_test.c:131] Successfully finished test read_manufacture_id in 5470 cycles or 547 us @ 10 MHz.
I00015 i2c_host_hdc1080_humidity_temp_test.c:132] Starting test read_product_id...
I00016 i2c_host_hdc1080_humidity_temp_test.c:132] Successfully finished test read_product_id in 5370 cycles or 537 us @ 10 MHz.
I00017 i2c_host_hdc1080_humidity_temp_test.c:133] Starting test read_temperature...
I00018 i2c_host_hdc1080_humidity_temp_test.c:133] Successfully finished test read_temperature in 130797 cycles or 13079 us @ 10 MHz.
I00019 i2c_host_hdc1080_humidity_temp_test.c:134] Starting test read_humidity...
I00020 i2c_host_hdc1080_humidity_temp_test.c:134] Successfully finished test read_humidity in 74606 cycles or 7460 us @ 10 MHz.
I00021 i2c_testutils.c:278] Setting i2c to FastPlus (1000kHz) mode.
I00022 i2c_testutils.c:302] period:400 nanos, cycles={fall=1, rise=1, hi=1, lo=2}
I00023 i2c_host_hdc1080_humidity_temp_test.c:131] Starting test read_manufacture_id...
-- Test timed out at 2023-05-30 15:05:01 UTC --

TIMEOUT: //sw/device/tests:i2c_host_hdc1080_humidity_temp_test_fpga_cw310_test_rom (Summary) /home/doreis/.cache/bazel/_bazel_doreis/71b02480bfcfe3a0741719bf45ac42d1/execroot/lowrisc_opentitan/bazel-out/k8-fastbuild/testlogs/sw/device/tests/i2c_host_hdc1080_humidity_temp_test_fpga_cw310_test_rom/test.log Aspect @rules_rust//rust/private:clippy.bzl%rust_clippy_aspect of //sw/device/tests:i2c_host_hdc1080_humidity_temp_test_fpga_cw310_test_rom up-to-date (nothing to build) INFO: Elapsed time: 60.488s, Critical Path: 60.18s INFO: 2 processes: 2 local. INFO: Build completed, 1 test FAILED, 2 total actions //sw/device/tests:i2c_host_hdc1080_humidity_temp_test_fpga_cw310_test_rom TIMEOUT in 60.1s /home/doreis/.cache/bazel/_bazel_doreis/71b02480bfcfe3a0741719bf45ac42d1/execroot/lowrisc_opentitan/bazel-out/k8-fastbuild/testlogs/sw/device/tests/i2c_host_hdc1080_humidity_temp_test_fpga_cw310_test_rom/test.log

INFO: Build completed, 1 test FAILED, 2 total actions

engdoreis commented 1 year ago

@johngt

a-will commented 1 year ago

Since this is a read transaction, it's the I2C host that is creating a NAK (which it must do at the end of its read transaction). That waveform looks like a perfectly good 2-byte read. Was that the intention?

engdoreis commented 1 year ago

I updated the description to make it clearer. So the waves are correct, the problem is that the nak bit in the INTR_STATE goes high after this read transaction, meaning that the i2c detected a NAK from the device during the read transaction, which shouldn't'd happen as the last NAK shown in the image is actually created by the host as you mentioned and this doesn't happen in standard and high-speed mode.

a-will commented 1 year ago

This is peculiar on a couple of levels, hehe.

Despite the IP registering a NAK, the transaction continues normally, when it should really stop and wait for software to clear the NAK condition. It turns out that there is nothing in the i2c FSM to prevent continuing with the next entry in the FIFO. Depending on how you look at it, that's a bug (alternatively, a significant performance impediment).

In addition, we might be seeing bad effects for especially small t_low and t_high times. There is latency in the data path, and I'm not sure we've established the minimum values to avoid conflicts. We might be seeing t_low=2 be too small to accommodate that latency reliably.

Edit: Note also that the NAK event (for INTR_STATE) only occurs when the host side is driving the data bits. That means this happened during the command/address byte, and the IP failed to properly sample the low ACK bit following the transition from the high RNW bit.

a-will commented 1 year ago

With t_low=2, t_hd=1, we are in the HoldBit and ClockLowAck states for a single cycle each. The outputs are registered, so even though the HoldBit state nominally begins to drive SCL low, that doesn't take effect until the cycle after, when we hit the ClockLowAck state.

At the moment we enter the ClockLowAck state (+ board and pad delays), that is when the device may start driving an ACK. However, we also have a 2-flop synchronizer on the input side. So the ACK could only arrive 2 cycles after entering ClockLowAck at minimum. Since we're in ClockLowAck for only one cycle, we still see SDA high in ClockPulseAck, causing the NAK event to fire.

Yup, t_low=2 is unsupported. t_low=3 might provide enough room for our setup time, but that assumes I've accounted for everything up there.

engdoreis commented 1 year ago

If I understand it correctly, this NAK detection problem won't happen with higher peripheral clocks ( >= 5Mhz)?

Regarding the fact that the transaction continues after the NAK, I noticed that and here is a case where the device returns a NAK to the address byte, the host successfully detects it, but continues the transaction anyway. This is an inconvenient behaviour, I think it should be fixed if possible.

a-will commented 1 year ago

If I understand it correctly, this NAK detection problem won't happen with higher peripheral clocks ( >= 5Mhz)?

The problem is most accurately described by the number of clock cycles that make up each phase, but given a specific target frequency, yes, a high enough input frequency will satisfy the minimum divisor value. In this case, we haven't established whether t_high=2 never would surface an error, and that would be a requirement for a 5 MHz input to be enough for a 1 MHz output. For a nominal 50% duty cycle, 6 MHz would be the minimum (t_high == t_low == 3).

On the side, target mode's requirements would be even higher, since there needs to be tolerance for the host's timing not falling in convenient places for sampling. If 6 MHz is the minimum for host mode and target mode operates similarly, most likely something like 8 MHz would be the minimum for fast-plus speeds in target mode.

GregAC commented 1 year ago

The key thing to establish/convince ourselves of here: Will this work correct for fast mode plus when we are running on the 25 MHz peripheral clock? From what I'm reading above I believe the answer is yes.

Could be worth trying to do a one off FPGA build that's just Ibex and an I2C instance so we can run at a higher clock speed and trial the fast mode plus.

msfschaffner commented 1 year ago

@msfschaffner

a-will commented 1 year ago

You can use my fpga-24mhz-experiment branch as a base for your i2c code + the files for a "bitstreams cache entry" attached here.

Drop the 03af3772c directory into a standalone bazel cache (say, at ${HOME}/bitstream-experiments/cache/03af3772c), then use this bitstream by setting BAZEL_BITSTREAMS_CACHE and BITSTREAM:

BAZEL_BITSTREAMS_CACHE=${HOME}/bitstream-experiments/ BITSTREAM="--offline 03af3772c" bazel test ...

bazel will pick up that entry like any other cached bitstream and splice the test ROM or ROM as needed.

Note that this bitstream does not meet timing in one specific way: IO_CLK is too fast for SPI_HOST0 to meet the setup time at full blast. Everything else is good. So this bitstream is fine to use for I2C experiments with a 6 MHz peripheral clock, but don't use it for running SPI_HOST0 at 12 MHz. ;)

03af3772c.tar.gz

msfschaffner commented 1 year ago

So it looks like two issues surfaced here:

1) minimum supported t_low / t_high times. we can probably address this by updating the docs and update the FPGA mapping so that the I2C runs at a higher peripheral frequency. 2) the NAK detection behavior, which seems problematic. it sounds like there is a SW workaround (just issue one read transaction at a time), but that could have a noticeable performance impact.

For 2) I think it would be good to analyze what the potential design impact would be, so that we can triage this as an ECO (depending on the impact, we may not be able to take it). @GregAC is that something you could look into?

a-will commented 1 year ago

Aye, for (2), the first workaround that came to mind was to treat the fmt fifo like it is only 1-deep. (though the read data phase could technically permit more entries, since the host supplies ACK / NAK)

GregAC commented 1 year ago

If I understand it correctly, this NAK detection problem won't happen with higher peripheral clocks ( >= 5Mhz)?

Regarding the fact that the transaction continues after the NAK, I noticed that and here is a case where the device returns a NAK to the address byte, the host successfully detects it, but continues the transaction anyway. This is an inconvenient behaviour, I think it should be fixed if possible.

Just to clarify here @engdoreis is this showing the NAK that's been detected and the erroneous continued transaction isn't on the trace?

crteja commented 1 year ago

So it looks like two issues surfaced here:

minimum supported t_low / t_high times. we can probably address this by updating the docs and update the FPGA mapping so that the I2C runs at a higher peripheral frequency.

the NAK detection behavior, which seems problematic. it sounds like there is a SW workaround (just issue one read transaction at a time), but that could have a noticeable performance impact.

For 2) I think it would be good to analyze what the potential design impact would be, so that we can triage this as an ECO (depending on the impact, we may not be able to take it). @GregAC is that something you could look into?

I have tried to reproduce the issue using the i2c_host_perf_seq.sv. In all the three modes supported, there is no NAK interrupt. In Host mode, IP sends a NACK and then STOP transaction similar to the waves in the issue, except there is no NAK interrupt. in RTL, NAK interrupt is asserted only in one state( which is used to detect ACK/NACK). https://github.com/lowRISC/opentitan/blob/cebbe1809bc409850f2bf563fab8c3dd96ccfc0c/hw/ip/i2c/rtl/i2c_fsm.sv#L531-L540

Regarding NAK behaviour, this is the way I2C FSM is implemented. IP doesn't issue stop if there is a NAK from device, NAK interrupt is raised but IP will still send transactions out based on FMT FIFO. It is the host's responsibility to clear FIFO in case of a NAK and restart the transaction

Waves:

note from Spec:

GregAC commented 1 year ago

This looks to be more a specification/programming model bug than an actual RTL bug.

Perhaps the author and some reviewers felt it was implicit/obvious that the NAK interrupt should cause the FSM to stop processing further items from the FIFO, but that's not explicitly stated in the spec.

Producing an ECO to alter this behaviour could also be complex. Whilst doing something that causes the FSM to stop processing following the NAK may be easy there's a question of what starts it up again? Perhaps we'd raise the NAK interrupt and clear the FIFO? Might be feasible with a few gates depending on whether there's an identifiable FIFO clear signal in the netlist that can be tapped into.

a-will commented 1 year ago

If I understand it correctly, this NAK detection problem won't happen with higher peripheral clocks ( >= 5Mhz)? Regarding the fact that the transaction continues after the NAK, I noticed that and here is a case where the device returns a NAK to the address byte, the host successfully detects it, but continues the transaction anyway. This is an inconvenient behaviour, I think it should be fixed if possible.

Just to clarify here @engdoreis is this showing the NAK that's been detected and the erroneous continued transaction isn't on the trace?

No, two separate things.

First, the IP thought a NAK occurred when its FSM checked the response to the address / command byte. Due to latency in the path, it checked a delayed sample of its RNW bit, not the actual ACK that was on the bus, working its way through a prim_flop_2sync.

Then, the IP behaved improperly relative to its own internal notion of the state. It thought it received a NAK, so the transaction ought not to continue, but it just kept going. Thus, the trace doesn't show any indication something went wrong internally.

GregAC commented 1 year ago

Thanks for the clarification @a-will though looking at the trace from @engdoreis's screen shot that doesn't show an instance of a falsely identified NAK? It's an actual NAK (which is the interpretation from the logic analyser as well as my reading of the trace) followed by more transactions (the I2C just clocking in expectation of bytes to read that don't appear).

What's your view on what the I2C documentation says vs the RTL behaviour, the details are here https://opentitan.org/book/hw/ip/i2c/doc/theory_of_operation.html#byte-formatted-programming-mode?

With regards to

It thought it received a NAK, so the transaction ought not to continue, but it just kept going.

I agree, in that the NAK should stop the transaction and raise the interrupt so the software can decide what to do but the documented behaviour doesn't specify anything around NAK behaviour and a fix here would involve updating the documentation and programming model.

johngt commented 1 year ago

@moidx - could you speak with customer side to see if the case of NACKs being reported is going to be an issue. There are SW workarounds that will impact performance.

engdoreis commented 1 year ago

@GregAC, I posted two traces: In the first (in the issue description) it shows the i2c host receiving an ACK, but it erroneously flagged a NAK in the INTR_STATE.nak register, if you execute the test, you'll see software retiring until timeout.

The second trace here shows the i2c host receiving a NAK, in this case it correctly flagged in the INTR_STATE.nak register. But instead of stopping at the addressing stage, it kept going as pointed out by @a-will.

a-will commented 1 year ago

Thanks for the clarification @a-will though looking at the trace from @engdoreis's screen shot that doesn't show an instance of a falsely identified NAK? It's an actual NAK (which is the interpretation from the logic analyser as well as my reading of the trace) followed by more transactions (the I2C just clocking in expectation of bytes to read that don't appear).

Oops, I didn't pay attention to which trace you quoted, haha. Sorry about that! I blurted out an explanation of the first trace, but you grabbed the second.

What's your view on what the I2C documentation says vs the RTL behaviour, the details are here https://opentitan.org/book/hw/ip/i2c/doc/theory_of_operation.html#byte-formatted-programming-mode?

With regards to

It thought it received a NAK, so the transaction ought not to continue, but it just kept going.

I agree, in that the NAK should stop the transaction and raise the interrupt so the software can decide what to do but the documented behaviour doesn't specify anything around NAK behaviour and a fix here would involve updating the documentation and programming model.

The documentation shows we didn't totally think through how this FIFO would interact with responses outside the happy path. This has been a common refrain for the i2c IP, and when I last looked at the FSM while reviewing updates for target mode, I missed this myself. If the FSM proceeds past all negative responses, there is little point in having a FIFO in the first place, hehe.

So agreed, we'll need to fix the documentation as well. :)

marnovandermaas commented 1 year ago

Ok, so from my understanding, we need to update the documentation to say two extra things:

That t_low and t_high must be larger than 2
That the host must explicitly clear the FMT FIFO if a NAK interrupt is raised, otherwise the IP will continue the transaction as if the NAK is not received.

I think it makes sense to do these updates as part of the V2 process: https://github.com/lowRISC/opentitan/issues/18741

GregAC commented 1 year ago

That the host must explicitly clear the FMT FIFO if a NAK interrupt is raised, otherwise the IP will continue the transaction as if the NAK is not received.

The issue (as I understand it) is the FSM may have begun processing the next item in the FSM before the NAK can be dealt with by software so if you need to consider NAKs you can't load up the FIFO with actions following the potential NAK.

a-will commented 1 year ago

@crteja To reproduce the issue, you'll have to override the calculations of timing variables. Once you get the right thigh and tlow, it'll appear:

engdoreis commented 1 year ago

The issue (as I understand it) is the FSM may have begun processing the next item in the FSM before the NAK can be dealt with by software so if you need to consider NAKs you can't load up the FIFO with actions following the potential NAK.

I tested this workarroud and it does work. Here's what I've done:

Issue a write operation by writing the device address with read bit to FDATA.FBYTE and FDATA.START=1;
Check for a nak, if nak go back to 1;
Else, issue a read operation by writing FDATA.FBYTE=3, FDATA.STOP=1 and FDATA.READ=1.

This makes the i2c_host repeatdly write the device address in the bus. When the device finally respond an ACK, we see an NAK after the first byte read(0x74) and then the next 2 bytes are garbage as in the image below.

As referece this a successfull read using the current approach:

marnovandermaas commented 1 year ago

I had a look through the FSM, and I think it should only NAK if it has read FBYTE amount of bytes: https://github.com/lowRISC/opentitan/blob/d6409a6bb535e817a1372e19612794cf5d943af2/hw/ip/i2c/rtl/i2c_fsm.sv#L578

I'm not sure why it is ignoring the FBYTE value here (which should be 3), it would be useful to know the internal state of the FSM, namely the value of byte_index, byte_num and state_q.

a-will commented 1 year ago

The issue (as I understand it) is the FSM may have begun processing the next item in the FSM before the NAK can be dealt with by software so if you need to consider NAKs you can't load up the FIFO with actions following the potential NAK.

I tested this workarroud and it does work. Here's what I've done:

Issue a write operation by writing the device address with read bit to FDATA.FBYTE and FDATA.START=1;

Check for a nak, if nak go back to 1;

Else, issue a read operation by writing FDATA.FBYTE=3, FDATA.STOP=1 and FDATA.READ=1.

This makes the i2c_host repeatdly write the device address in the bus. When the device finally respond an ACK, we see an NAK after the first byte read(0x74) and then the next 2 bytes are garbage as in the image below.

Uh oh, there's a glitch on the clock line in there, just before beginning the read phase. Time to have another look at the FSM...

The first byte actually came out of the device as expected, but that extra clock caused the device to see a NAK, making it think the read is done.

a-will commented 1 year ago

Argh, HostHoldBitAck -> PopFmtFifo -> Idle is problematic.

HostHoldBitAck brings the clock low, with the assumption that after an ACK, we'll always have something in the FIFO ready, but if there's nothing in the FIFO, we go to Idle, which releases SCL and causes a posedge. That's also a bug.

Probably, the decision with scl in "Idle" will depend on whether we have an ongoing transaction (from the trans_started flop).

@msfschaffner This means we don't have a software workaround.

msfschaffner commented 1 year ago

I see - that is problematic. Is there a way we could implement a minimal change that at least enables the SW workaround above? Or do we run into the same problem again that we would have to change the next state logic, which could potentially affect a large logic cone?

a-will commented 1 year ago

So I'll test this out, but most likely, this would work:

diff --git a/hw/ip/i2c/rtl/i2c_fsm.sv b/hw/ip/i2c/rtl/i2c_fsm.sv
index 9dea267e5..b0d28f621 100644
--- a/hw/ip/i2c/rtl/i2c_fsm.sv
+++ b/hw/ip/i2c/rtl/i2c_fsm.sv
@@ -473,11 +473,18 @@ module i2c_fsm import i2c_pkg::*;
     stretch_en = 1'b0;
     expect_stop = 1'b0;
     unique case (state_q)
-      // Idle: initial state, SDA and SCL are released (high)
+      // Idle: initial state, SDA is released (high), SCL is released if the
+      // bus is idle. Otherwise, if no STOP condition has been sent yet,
+      // continue pulling SCL low in host mode.
       Idle : begin
-        host_idle_o = 1'b1;
         sda_d = 1'b1;
-        scl_d = 1'b1;
+        if (host_enable_i && trans_started) begin
+          host_idle_o = 1'b0;
+          scl_d = 1'b0;
+        end else begin
+          host_idle_o = 1'b1;
+          scl_d = 1'b1;
+        end
       end
       // SetupStart: SDA and SCL are released
       SetupStart : begin

This would avoid modifying the next state logic. However, that may not be the ideal RTL change: From a code readability perspective, it would probably be better for these cases to proceed to Active instead of Idle (and for the Active state to wait for the next FMT FIFO entry).

msfschaffner commented 1 year ago

Thanks @a-will. It would be acceptable to make an intermediate fix for now, create an issue to track that and come back later to make a "nice" fix.

engdoreis commented 1 year ago

I tested the bitstream generated in the PR above using the same code used here which does:

Issue a write operation by writing the device address with read bit to FDATA.FBYTE and FDATA.START=1;
Check if host is idle (STATUS.HOSTIDLE).
Check for a NAK (INTR_STATE.NAK), if NAK go back to 1;
Else, issue a read operation by writing FDATA.FBYTE=3, FDATA.STOP=1 and FDATA.READ=1.

The fix works but with a small remark. After the step 1, the STATUS.HOSTIDLE doesn't go to 1 (I think this is an intentional change), however the step 2 can't be done any more, so I needed another way to check when it is safe to assume that INTR_STATE.NAK=0 means an ACK. So I used the STATUS.FMT_FIFO_EMPTY instead which worked.

So this behaviour should be documented.

a-will commented 1 year ago

@engdoreis I did intentionally change STATUS.HOSTIDLE to reflect that the i2c transaction was still in progress, and the IP is still holding SCL low with an empty FMT FIFO.

The change to wait for FMT FIFO empty reflects the better signal to observe for the workaround, since the doc explicitly shows an entry is removed only after it has been completely transmitted (i.e. in the FSM diagram). 😄

johngt commented 1 year ago

@engdoreis - could you prepare the documentation for this as a PR based on the comments above so that we can close this issue out? @luismarques + @mundaym for visibility.

engdoreis commented 1 year ago

Yes, I'll do it.

engdoreis commented 1 year ago

Documentation added in this PR, please review whenever you get a chance.

lowRISC / opentitan