lowRISC / opentitan

OpenTitan: Open source silicon root of trust
https://www.opentitan.org
Apache License 2.0
2.56k stars 762 forks source link

[spi_device] SDC updates & Enhance pass-through rates for wider reads by pipelining the return path #11718

Closed a-will closed 8 months ago

a-will commented 2 years ago

Add an optional pipeline stage or two to the return path for passthrough reads, and allow read commands to choose whether to use that path or the combinatorial one. Default to the combinatorial path for the first cycle and prepare for the mux selection change. In other words, for the data path it's just...

// read_data must at least have the mailbox and passthrough contributing, depending on the address
// Alternatively, use separate paths for them and have the mailbox side count out more dummy cycles.
// (Not familiar with the current architecture and whether one is easier than the other)
assign data_to_host = is_pipelined_read ? pipelined_data : read_data;
always_ff (posedge sck) begin
  pipelined_data <= read_data;
end

..with is_pipelined_read coming from cmd_info tables.

For wider read opcodes (dual read, quad read), dummy cycles are part of the protocol, and during those dummy cycles, the host ignores the signal from the device. The SFDP advertises the number for each command, and firmware can take the value advertised by the SPI flash for dual read and quad read, add the pipeline latency to it, and advertise the new value in its SFDP. Then, the host behaves as though that were the true read latency all along and sends the needed additional SCK cycles.

This effectively cuts in half the round-trip path. The forward delay from the host to OT's passthrough I/Os is absorbed by the pipeline stage (since it is sampling read data), and timing for the command is reduced to an analysis between OT's passthrough/spi_host interface and the SPI flash (with the original host still contributing some potential skew, of course). The second hop of the return path is just the output of a pipeline stage back to the host, so that interface looks like the host is communicating directly with OT.

The expectation is that this will greatly increase the maximum read clock rate for dual read and quad read, while minimally slowing down normal read and possibly fast read (due to one extra logic level from the mux). The thinking here is that devices needing higher throughput will use faster interfaces, and focusing on speeding those commands up will yield the greatest benefit.

This may also be possible for fast read (1-1-1), but 8 dummy cycles may be the expectation there. Normal read must continue to use the combinatorial path. Also, the SPI flash device sees extra SCK cycles, but we simply drop the data the host didn't ask for.

CC @tjaychen @eunchan (Just logging the idea here, since we've randomly had discussions and can forget)

estimate 24 remaining 2023-03-23 24

eunchan commented 2 years ago

Nice. Thanks for the issue. Definitely worth it in Fast read mode.

eunchan commented 2 years ago

I thought about this. Before moving forward this idea, would you mind let me know it is possible to control the # of dummy cycles on the host system?

I checked a few flash devices' datasheets. The granularity of the dummy cycle is not one cycle. Some devices only support byte granularity.

If host system allows 9 dummy cycles, for the Fast Read, 5 cycles for the Fast Read Dual, 3 cycles for the Fast Read Quad, SPI_DEVICE would support this feature by allowing longer dummy cycles and mux for the pipelined read data as suggested.

Currently, SPI_DEVICE only supports up to 8 dummy cycles.

a-will commented 2 years ago

I thought about this. Before moving forward this idea, would you mind let me know it is possible to control the # of dummy cycles on the host system?

I checked a few flash devices' datasheets. The granularity of the dummy cycle is not one cycle. Some devices only support byte granularity.

If host system allows 9 dummy cycles, for the Fast Read, 5 cycles for the Fast Read Dual, 3 cycles for the Fast Read Quad, SPI_DEVICE would support this feature by allowing longer dummy cycles and mux for the pipelined read data as suggested.

Currently, SPI_DEVICE only supports up to 8 dummy cycles.

From a compatibility perspective, we might want to prioritize quad SPI and make it 2 pipeline registers. As you mentioned, a lot of generic SPI controllers can only handle a granularity of 8 bits, so the dummy cycles would need to add up to a multiple of 8 bits somehow.

Assuming it is 2 pipeline registers, for dual SPI and fast read, the feature's utility would be contingent on either a more flexible SPI host controller or a SPI flash device that allows tuning dummy cycles to compensate (e.g. various Micron or Macronix parts).

andreaskurth commented 1 year ago

Triaged for spi_device. Assigning to M2.5 with https://github.com/lowRISC/opentitan/labels/Priority%3AP2 because we should confirm with the product team that current pass-through rates suffice. If they don't, we need to prioritize this. If they do, we can defer this to https://github.com/lowRISC/opentitan/labels/Type%3AFutureRelease.

hcallahan-lowrisc commented 1 year ago

Even if the RTL changes may be simple here, I suspect the DV and coverage work will be not trivial.

estimate 24 remaining 2023-03-23 24

andreaskurth commented 1 year ago

Yes, and to update on the status: the downstream teams are discussing this and their feedback is pending

johngt commented 1 year ago

Please note the response in https://docs.google.com/spreadsheets/d/12gMTUi42Hu78eACrE0cShAsIPchuwOodhu1IkNfZJoE/edit#gid=998209441 QO-0010 Relevant to this issue; 1. Enable SDC constraints for passthrough (open source) @hcallahan-lowrisc / @andreaskurth - I'm assuming this means we should increase the priority of this issue for now?

moidx commented 1 year ago

Hold off from starting changes related to this issue until we have timing information on the current instantiation of the design, as well as DD and DV estimates.

moidx commented 1 year ago

Discussed during triaging meeting. The plan is to postpone any RTL optimizations post M2.5. Leaving the issue open to track the addition of SDC constraints so that we can complete the timing analysis.

Updating the estimate to 4 to track the SDC / synthesis iterations.

remaining 2023-04-13 4

johngt commented 1 year ago

@moidx - who will be working on the SDC constraints? Can you please assign the appropriate person? TIA

moidx commented 1 year ago

Assigning to @msfschaffner. I believe this should not be blocking for M2.5.1 as there won't be time to make any RTL changes.

@msfschaffner feel free to lower priority of switch to backlog. Thanks.

msfschaffner commented 1 year ago

@a-will and I will be doing that in the coming week or two. Let's track for M2.5.1 for now, but keep it as a P2 (we can defer later down the road if needed).

johngt commented 1 year ago

Per effort estimation exercise - the RTL part of this is deferred. Capturing that addition of SDC constraints was meant to be tracked here.

moidx commented 1 year ago

Targeting for M2.5.1-RC1. This is trending 2023-05-18.

johngt commented 1 year ago

@msfschaffner - probably need to check if this is on schedule?

a-will commented 1 year ago

FYI: #18589 addresses the SDC component. However, the SDC part probably ought to have gone into a separate tracking issue, as the original idea might still be useful for a future release. ;)

msfschaffner commented 1 year ago

We can move this issue to the M2.5 Backlog, since the SDC updates are tracked as part of #18274 now.