Closed engdoreis closed 1 year ago
@johngt
Since this is a read transaction, it's the I2C host that is creating a NAK (which it must do at the end of its read transaction). That waveform looks like a perfectly good 2-byte read. Was that the intention?
I updated the description to make it clearer. So the waves are correct, the problem is that the nak
bit in the INTR_STATE
goes high after this read transaction, meaning that the i2c detected a NAK from the device during the read transaction, which shouldn't'd happen as the last NAK shown in the image is actually created by the host as you mentioned and this doesn't happen in standard and high-speed mode.
This is peculiar on a couple of levels, hehe.
Despite the IP registering a NAK, the transaction continues normally, when it should really stop and wait for software to clear the NAK condition. It turns out that there is nothing in the i2c FSM to prevent continuing with the next entry in the FIFO. Depending on how you look at it, that's a bug (alternatively, a significant performance impediment).
In addition, we might be seeing bad effects for especially small t_low
and t_high
times. There is latency in the data path, and I'm not sure we've established the minimum values to avoid conflicts. We might be seeing t_low=2
be too small to accommodate that latency reliably.
Edit: Note also that the NAK event (for INTR_STATE) only occurs when the host side is driving the data bits. That means this happened during the command/address byte, and the IP failed to properly sample the low ACK bit following the transition from the high RNW bit.
With t_low=2, t_hd=1
, we are in the HoldBit and ClockLowAck states for a single cycle each. The outputs are registered, so even though the HoldBit state nominally begins to drive SCL low, that doesn't take effect until the cycle after, when we hit the ClockLowAck state.
At the moment we enter the ClockLowAck state (+ board and pad delays), that is when the device may start driving an ACK. However, we also have a 2-flop synchronizer on the input side. So the ACK could only arrive 2 cycles after entering ClockLowAck at minimum. Since we're in ClockLowAck for only one cycle, we still see SDA high in ClockPulseAck, causing the NAK event to fire.
Yup, t_low=2
is unsupported. t_low=3
might provide enough room for our setup time, but that assumes I've accounted for everything up there.
If I understand it correctly, this NAK detection problem won't happen with higher peripheral clocks ( >= 5Mhz)?
Regarding the fact that the transaction continues after the NAK, I noticed that and here is a case where the device returns a NAK to the address byte, the host successfully detects it, but continues the transaction anyway. This is an inconvenient behaviour, I think it should be fixed if possible.
If I understand it correctly, this NAK detection problem won't happen with higher peripheral clocks ( >= 5Mhz)?
The problem is most accurately described by the number of clock cycles that make up each phase, but given a specific target frequency, yes, a high enough input frequency will satisfy the minimum divisor value. In this case, we haven't established whether t_high=2
never would surface an error, and that would be a requirement for a 5 MHz input to be enough for a 1 MHz output. For a nominal 50% duty cycle, 6 MHz would be the minimum (t_high == t_low == 3).
On the side, target mode's requirements would be even higher, since there needs to be tolerance for the host's timing not falling in convenient places for sampling. If 6 MHz is the minimum for host mode and target mode operates similarly, most likely something like 8 MHz would be the minimum for fast-plus speeds in target mode.
The key thing to establish/convince ourselves of here: Will this work correct for fast mode plus when we are running on the 25 MHz peripheral clock? From what I'm reading above I believe the answer is yes.
Could be worth trying to do a one off FPGA build that's just Ibex and an I2C instance so we can run at a higher clock speed and trial the fast mode plus.
@msfschaffner
You can use my fpga-24mhz-experiment branch as a base for your i2c code + the files for a "bitstreams cache entry" attached here.
Drop the 03af3772c
directory into a standalone bazel cache (say, at ${HOME}/bitstream-experiments/cache/03af3772c
), then use this bitstream by setting BAZEL_BITSTREAMS_CACHE and BITSTREAM:
BAZEL_BITSTREAMS_CACHE=${HOME}/bitstream-experiments/ BITSTREAM="--offline 03af3772c" bazel test ...
bazel will pick up that entry like any other cached bitstream and splice the test ROM or ROM as needed.
Note that this bitstream does not meet timing in one specific way: IO_CLK is too fast for SPI_HOST0 to meet the setup time at full blast. Everything else is good. So this bitstream is fine to use for I2C experiments with a 6 MHz peripheral clock, but don't use it for running SPI_HOST0 at 12 MHz. ;)
So it looks like two issues surfaced here:
1) minimum supported t_low / t_high times. we can probably address this by updating the docs and update the FPGA mapping so that the I2C runs at a higher peripheral frequency. 2) the NAK detection behavior, which seems problematic. it sounds like there is a SW workaround (just issue one read transaction at a time), but that could have a noticeable performance impact.
For 2) I think it would be good to analyze what the potential design impact would be, so that we can triage this as an ECO (depending on the impact, we may not be able to take it). @GregAC is that something you could look into?
Aye, for (2), the first workaround that came to mind was to treat the fmt fifo like it is only 1-deep. (though the read data phase could technically permit more entries, since the host supplies ACK / NAK)
If I understand it correctly, this NAK detection problem won't happen with higher peripheral clocks ( >= 5Mhz)?
Regarding the fact that the transaction continues after the NAK, I noticed that and here is a case where the device returns a NAK to the address byte, the host successfully detects it, but continues the transaction anyway. This is an inconvenient behaviour, I think it should be fixed if possible.
Just to clarify here @engdoreis is this showing the NAK that's been detected and the erroneous continued transaction isn't on the trace?
So it looks like two issues surfaced here:
- minimum supported t_low / t_high times. we can probably address this by updating the docs and update the FPGA mapping so that the I2C runs at a higher peripheral frequency.
- the NAK detection behavior, which seems problematic. it sounds like there is a SW workaround (just issue one read transaction at a time), but that could have a noticeable performance impact.
For 2) I think it would be good to analyze what the potential design impact would be, so that we can triage this as an ECO (depending on the impact, we may not be able to take it). @GregAC is that something you could look into?
I have tried to reproduce the issue using the i2c_host_perf_seq.sv
. In all the three modes supported, there is no NAK interrupt. In Host mode, IP sends a NACK and then STOP transaction similar to the waves in the issue, except there is no NAK interrupt.
in RTL, NAK interrupt is asserted only in one state( which is used to detect ACK/NACK).
https://github.com/lowRISC/opentitan/blob/cebbe1809bc409850f2bf563fab8c3dd96ccfc0c/hw/ip/i2c/rtl/i2c_fsm.sv#L531-L540
Regarding NAK behaviour, this is the way I2C FSM is implemented. IP doesn't issue stop if there is a NAK from device, NAK interrupt is raised but IP will still send transactions out based on FMT FIFO. It is the host's responsibility to clear FIFO in case of a NAK and restart the transaction
Waves:
note from Spec:
This looks to be more a specification/programming model bug than an actual RTL bug.
Perhaps the author and some reviewers felt it was implicit/obvious that the NAK interrupt should cause the FSM to stop processing further items from the FIFO, but that's not explicitly stated in the spec.
Producing an ECO to alter this behaviour could also be complex. Whilst doing something that causes the FSM to stop processing following the NAK may be easy there's a question of what starts it up again? Perhaps we'd raise the NAK interrupt and clear the FIFO? Might be feasible with a few gates depending on whether there's an identifiable FIFO clear signal in the netlist that can be tapped into.
If I understand it correctly, this NAK detection problem won't happen with higher peripheral clocks ( >= 5Mhz)? Regarding the fact that the transaction continues after the NAK, I noticed that and here is a case where the device returns a NAK to the address byte, the host successfully detects it, but continues the transaction anyway. This is an inconvenient behaviour, I think it should be fixed if possible.
Just to clarify here @engdoreis is this showing the NAK that's been detected and the erroneous continued transaction isn't on the trace?
No, two separate things.
First, the IP thought a NAK occurred when its FSM checked the response to the address / command byte. Due to latency in the path, it checked a delayed sample of its RNW bit, not the actual ACK that was on the bus, working its way through a prim_flop_2sync.
Then, the IP behaved improperly relative to its own internal notion of the state. It thought it received a NAK, so the transaction ought not to continue, but it just kept going. Thus, the trace doesn't show any indication something went wrong internally.
Thanks for the clarification @a-will though looking at the trace from @engdoreis's screen shot that doesn't show an instance of a falsely identified NAK? It's an actual NAK (which is the interpretation from the logic analyser as well as my reading of the trace) followed by more transactions (the I2C just clocking in expectation of bytes to read that don't appear).
What's your view on what the I2C documentation says vs the RTL behaviour, the details are here https://opentitan.org/book/hw/ip/i2c/doc/theory_of_operation.html#byte-formatted-programming-mode?
With regards to
It thought it received a NAK, so the transaction ought not to continue, but it just kept going.
I agree, in that the NAK should stop the transaction and raise the interrupt so the software can decide what to do but the documented behaviour doesn't specify anything around NAK behaviour and a fix here would involve updating the documentation and programming model.
@moidx - could you speak with customer side to see if the case of NACKs being reported is going to be an issue. There are SW workarounds that will impact performance.
@GregAC, I posted two traces: In the first (in the issue description) it shows the i2c host receiving an ACK, but it erroneously flagged a NAK in the INTR_STATE.nak register, if you execute the test, you'll see software retiring until timeout.
The second trace here shows the i2c host receiving a NAK, in this case it correctly flagged in the INTR_STATE.nak register. But instead of stopping at the addressing stage, it kept going as pointed out by @a-will.
Thanks for the clarification @a-will though looking at the trace from @engdoreis's screen shot that doesn't show an instance of a falsely identified NAK? It's an actual NAK (which is the interpretation from the logic analyser as well as my reading of the trace) followed by more transactions (the I2C just clocking in expectation of bytes to read that don't appear).
Oops, I didn't pay attention to which trace you quoted, haha. Sorry about that! I blurted out an explanation of the first trace, but you grabbed the second.
What's your view on what the I2C documentation says vs the RTL behaviour, the details are here https://opentitan.org/book/hw/ip/i2c/doc/theory_of_operation.html#byte-formatted-programming-mode?
With regards to
It thought it received a NAK, so the transaction ought not to continue, but it just kept going.
I agree, in that the NAK should stop the transaction and raise the interrupt so the software can decide what to do but the documented behaviour doesn't specify anything around NAK behaviour and a fix here would involve updating the documentation and programming model.
The documentation shows we didn't totally think through how this FIFO would interact with responses outside the happy path. This has been a common refrain for the i2c IP, and when I last looked at the FSM while reviewing updates for target mode, I missed this myself. If the FSM proceeds past all negative responses, there is little point in having a FIFO in the first place, hehe.
So agreed, we'll need to fix the documentation as well. :)
Ok, so from my understanding, we need to update the documentation to say two extra things:
I think it makes sense to do these updates as part of the V2 process: https://github.com/lowRISC/opentitan/issues/18741
That the host must explicitly clear the FMT FIFO if a NAK interrupt is raised, otherwise the IP will continue the transaction as if the NAK is not received.
The issue (as I understand it) is the FSM may have begun processing the next item in the FSM before the NAK can be dealt with by software so if you need to consider NAKs you can't load up the FIFO with actions following the potential NAK.
@crteja To reproduce the issue, you'll have to override the calculations of timing variables. Once you get the right thigh
and tlow
, it'll appear:
The issue (as I understand it) is the FSM may have begun processing the next item in the FSM before the NAK can be dealt with by software so if you need to consider NAKs you can't load up the FIFO with actions following the potential NAK.
I tested this workarroud and it does work. Here's what I've done:
This makes the i2c_host repeatdly write the device address in the bus. When the device finally respond an ACK, we see an NAK after the first byte read(0x74) and then the next 2 bytes are garbage as in the image below.
As referece this a successfull read using the current approach:
I had a look through the FSM, and I think it should only NAK if it has read FBYTE amount of bytes: https://github.com/lowRISC/opentitan/blob/d6409a6bb535e817a1372e19612794cf5d943af2/hw/ip/i2c/rtl/i2c_fsm.sv#L578
I'm not sure why it is ignoring the FBYTE value here (which should be 3), it would be useful to know the internal state of the FSM, namely the value of byte_index
, byte_num
and state_q
.
The issue (as I understand it) is the FSM may have begun processing the next item in the FSM before the NAK can be dealt with by software so if you need to consider NAKs you can't load up the FIFO with actions following the potential NAK.
I tested this workarroud and it does work. Here's what I've done:
- Issue a write operation by writing the device address with read bit to FDATA.FBYTE and FDATA.START=1;
- Check for a nak, if nak go back to 1;
- Else, issue a read operation by writing FDATA.FBYTE=3, FDATA.STOP=1 and FDATA.READ=1.
This makes the i2c_host repeatdly write the device address in the bus. When the device finally respond an ACK, we see an NAK after the first byte read(0x74) and then the next 2 bytes are garbage as in the image below.
Uh oh, there's a glitch on the clock line in there, just before beginning the read phase. Time to have another look at the FSM...
The first byte actually came out of the device as expected, but that extra clock caused the device to see a NAK, making it think the read is done.
Argh, HostHoldBitAck -> PopFmtFifo -> Idle is problematic.
HostHoldBitAck
brings the clock low, with the assumption that after an ACK, we'll always have something in the FIFO ready, but if there's nothing in the FIFO, we go to Idle
, which releases SCL and causes a posedge. That's also a bug.
Probably, the decision with scl in "Idle" will depend on whether we have an ongoing transaction (from the trans_started
flop).
@msfschaffner This means we don't have a software workaround.
I see - that is problematic. Is there a way we could implement a minimal change that at least enables the SW workaround above? Or do we run into the same problem again that we would have to change the next state logic, which could potentially affect a large logic cone?
So I'll test this out, but most likely, this would work:
diff --git a/hw/ip/i2c/rtl/i2c_fsm.sv b/hw/ip/i2c/rtl/i2c_fsm.sv
index 9dea267e5..b0d28f621 100644
--- a/hw/ip/i2c/rtl/i2c_fsm.sv
+++ b/hw/ip/i2c/rtl/i2c_fsm.sv
@@ -473,11 +473,18 @@ module i2c_fsm import i2c_pkg::*;
stretch_en = 1'b0;
expect_stop = 1'b0;
unique case (state_q)
- // Idle: initial state, SDA and SCL are released (high)
+ // Idle: initial state, SDA is released (high), SCL is released if the
+ // bus is idle. Otherwise, if no STOP condition has been sent yet,
+ // continue pulling SCL low in host mode.
Idle : begin
- host_idle_o = 1'b1;
sda_d = 1'b1;
- scl_d = 1'b1;
+ if (host_enable_i && trans_started) begin
+ host_idle_o = 1'b0;
+ scl_d = 1'b0;
+ end else begin
+ host_idle_o = 1'b1;
+ scl_d = 1'b1;
+ end
end
// SetupStart: SDA and SCL are released
SetupStart : begin
This would avoid modifying the next state logic. However, that may not be the ideal RTL change: From a code readability perspective, it would probably be better for these cases to proceed to Active instead of Idle (and for the Active state to wait for the next FMT FIFO entry).
Thanks @a-will. It would be acceptable to make an intermediate fix for now, create an issue to track that and come back later to make a "nice" fix.
I tested the bitstream generated in the PR above using the same code used here which does:
STATUS.HOSTIDLE
).The fix works but with a small remark.
After the step 1, the STATUS.HOSTIDLE
doesn't go to 1 (I think this is an intentional change), however the step 2 can't be done any more, so I needed another way to check when it is safe to assume that INTR_STATE.NAK=0
means an ACK. So I used the STATUS.FMT_FIFO_EMPTY
instead which worked.
So this behaviour should be documented.
@engdoreis I did intentionally change STATUS.HOSTIDLE
to reflect that the i2c transaction was still in progress, and the IP is still holding SCL low with an empty FMT FIFO.
The change to wait for FMT FIFO empty reflects the better signal to observe for the workaround, since the doc explicitly shows an entry is removed only after it has been completely transmitted (i.e. in the FSM diagram). 😄
@engdoreis - could you prepare the documentation for this as a PR based on the comments above so that we can close this issue out? @luismarques + @mundaym for visibility.
Yes, I'll do it.
Description
Introduction
I2C host is erroneously detecting NAK when communicating to the temperature sensor HDC1080 in Fast-Speed Plus mode (1MHz). On the image below, the host sends the address to the device (0x40) with read bit and receives an ACK. However, the NAK IRQ goes high (INTR_STATE.nak) indicating that a NAK was detected.
This could be related to the fact that the i2c is configured to 1Mhz but the
SCL
line is actually at 277kHz as shown in the image below, this is described in the issue #18492How to reproduce:
i2c_host_hdc1080_humidity_temp_test
with the command:Output
TIMEOUT: //sw/device/tests:i2c_host_hdc1080_humidity_temp_test_fpga_cw310_test_rom (Summary) /home/doreis/.cache/bazel/_bazel_doreis/71b02480bfcfe3a0741719bf45ac42d1/execroot/lowrisc_opentitan/bazel-out/k8-fastbuild/testlogs/sw/device/tests/i2c_host_hdc1080_humidity_temp_test_fpga_cw310_test_rom/test.log Aspect @rules_rust//rust/private:clippy.bzl%rust_clippy_aspect of //sw/device/tests:i2c_host_hdc1080_humidity_temp_test_fpga_cw310_test_rom up-to-date (nothing to build) INFO: Elapsed time: 60.488s, Critical Path: 60.18s INFO: 2 processes: 2 local. INFO: Build completed, 1 test FAILED, 2 total actions //sw/device/tests:i2c_host_hdc1080_humidity_temp_test_fpga_cw310_test_rom TIMEOUT in 60.1s /home/doreis/.cache/bazel/_bazel_doreis/71b02480bfcfe3a0741719bf45ac42d1/execroot/lowrisc_opentitan/bazel-out/k8-fastbuild/testlogs/sw/device/tests/i2c_host_hdc1080_humidity_temp_test_fpga_cw310_test_rom/test.log
INFO: Build completed, 1 test FAILED, 2 total actions