Combinational paths between RVALID_I/ERR_I and REQ_O

Silabs-ArjanB commented 5 years ago

Description: Combinational paths exist from data_rvalid_i and data_err_i to data_req_o and from instr_rvalid_i to instr_req_o. Such paths can potentially limit the achievable performance in systems that do not pipeline these signals as they would introduce a path from a slave to the master (i.e. Ibex) and then back to the slave. For details see the attached pdf.

Git tag: 242147239551be9fb8523fbc6f26f22b898d8d93 (august 26 2019)

IBEX0.pdf

imphil commented 5 years ago

Thanks for the detailed report @arjanbink! I have seen this loop before but apparently I didn't file a bug for it, thanks for doing that!

We also don't see this loop in Verilator lint at the moment because we don't connect the bus to ibex, that's something we should look into as well. (But probably not as part of this issue.)

Silabs-ArjanB commented 5 years ago

It is not a loop inside Ibex (but it could become a loop on system level)

vogelpi commented 5 years ago

Thanks for your detailed report @arjanbink , the timing diagram is really helpful!

I think we should change this. It basically means that the LSU needs to push out the second request before receiving the first RAVLID_I. In the shown example, the first RVALID_I is received together with the second GNT_I. Handling this case seems not to be too difficult to me. But I think we should also support the case if the second GNT_I comes before the first RVALID_I. To properly handle errors in this case requires quite some changes in the LSU FSM though.

For the instruction interface, it is probably more difficult as the whole interface is still much more convoluted.

Silabs-ArjanB commented 5 years ago

I also noted that the second transfer in a misaligned load/store is suppressed in case the first transfer receives an error. In itself that is fine of course, but I don't think it is mandated by any specification, so I would think this part could be simplified by getting rid of the 'second transfer suppression in case of an error' (the performance impact would be a don't care as this should be an exceptional scenario)

davideschiavone commented 5 years ago

@vogelpi , if the core does not wait the first valid before sending the second request, it means that we have potentially out of order memory transactions. Imagine a load to an APB space that may take 3/4 cycles followed by a single cycle load to the scratch pad for instance. So the second grant should not be received before the first valid (so the second request should wait the valid).

I agree with @arjanbink that this may limit the performance, but I have the personal opinion that this should be handled by a wrapper that simply breaks the paths (penalties for cycles but gains in max frequency).

For instance, (just thinking an easy example) one can hardwire the grant from outside to 1 and register the valid from the slave. This means that the core would see a VALID signal directly coming from a FLIP FLOP and the grant as a constant, thus nothing critical in timing.

Of course this solution delay by 1 cycle every VALID, but prevents to change the core/bus to support out of order memory transactions. Also, by removing the dependency of the core request from the valid is just moving the complexity from the core to outside. Just my on the fly thoughts :)

Let me know what you think as the same feature is in RISCY too

Silabs-ArjanB commented 5 years ago

@vogelpi

But I think we should also support the case if the second GNT_I comes before the first RVALID_I.

Agreed with that. The way I see it there is an address/write data channel with a REQ and GNT and a transfer is accepted each rising edge in which both REQ and GNT are high (and the order of REQ and GNT should be a don't care, but REQs should not be retracted). And then there is a read data/response channel in which an RVALID (and read data in case of a read) should happen earliest the cycle after a granted request. Ideally I don't think there should be more assumptions.

@davideschiavone I see the RISC-V interface as a point-to-point link (please correct me if I am wrong) and as such there should be no out of order transactions on such link. Of course, with my proposal multiple transactions can now overlap in manners not previously possible (but it is exactly that which we need for an efficient mapping to AHB). The 'out of order' topic is a concern for an external module (e.g. a bus matrix) which forwards transactions to multiple slaves with different latencies. I would consider it the responsibility of such external module to take care that on its point-to-point interface to the RISC-V, the transactions are still in the correct order (and one simple way to do that is to not forward a 2nd request to a slave before a 1st request has fully completed (i.e. RVALID = 1) (which would re-introduce the unwanted path) or by tracking the order of granted slaves so that received RVALIDs can be forwarded to RISC-V in the same order (this would neither incur unwanted combinational paths nor add a cycle latency on RVALID).

If you want I can more completely spec out a proposal for RISC-V bus interface modifications showing examples of what we talked about above.

vogelpi commented 5 years ago

Thanks @davideschiavone for sharing your thoughts and @arjanbink for getting back. I think we all agree that it is not the responsibility of RI5CY/Ibex (especially since they are relatively simple cores) to handle out-of-order responses. It is perfectly fine to mandate that responses are returned from downstream in the same order as the requests were granted. It is the responsibility of the module that connects the in-order core interface to a possible out-of-order interconnect to ensure correct ordering of the responses at the core interface.

I think this is pretty standard and can be solved for example as follows:

In an AXI4-based interconnect, in-order responses can be guaranteed by letting the protocol converter issuing both AXI transactions with the same ID. The interconnect is then not allowed to reorder them.
For the particular example you mentioned @davideschiavone , i.e., if the first load goes to APB space and the second to a faster scratchpad (I assume you are thinking specifically about PULP here), the DEMUX would need to ensure correct ordering of the responses. The most simple solution I can think of would just be to not accept the second request before the first one has finished (alternatively: not accept scratchpad requests when the slower interface has still an outstanding response). This may sound like a bad decision for performance, but as far as I remember
- none of the individual downstream interconnects can accept more than one transaction from the same core at a time, and
- both cores only do back-to-back memory transfers for unaligned accesses. If the two transactions of an unaligned access go to different regions/downstream interconnects, something is fishy anyways.

What do you think? Am I missing something?

Best regards, Pirmin

Silabs-ArjanB commented 5 years ago

Fully agreed

davideschiavone commented 5 years ago

Dear @vogelpi and @arjanbink .

I think we all agree that the out-of-order should be moved to the bus and all your examples and solutions make sense to me. I don't see any theoretical problem to move the complexity of preventing more requests before the first one is not finished (RVALID=1) to the interconnect as @arjanbink was proposing.

In terms of practical reasons instead, this would require extensive verification in all those existing systems (PULP to start) that assume that this "handling" is done by the core. This requires to change busses/protocols and first of all a lot of verification. Plus, it may be required to change also all the masters that behave like RISCY/Ibex in PULP.

That said, to me it makes sense and it is definitively possible. But how about the efforts mentioned above? Moreover, there are more systems outside PULP that uses RISCY, so this may create a bit of a mess outside :)

That is way I was suggesting a wrapper to break the paths even though is not efficient as pointed by @arjanbink .

We may want to take this decision with extreme care involving the main designers of the PULP systems and the external public users like the Ibex team, @arjanbink 's team etc.

Do you agree that this is a big change in terms of system implications or am I missing something?

Just to state it again, I am sure the modification makes sense as the "efficiency" is left to the bus instead of to the core, which makes the IP more flexible.

Silabs-ArjanB commented 5 years ago

Hi @davideschiavone ,

Agreed that the biggest effort would be verification.

For existing systems however you could still achieve the current behavior with a the changed Ibex/RI5CY by an adapter that would reintroduce the combinational path between REQ_O and RVALID_I just outside the core (or inside the core based on some parameter to offer backward compatibility). (The adapter would only let a REQ pass through when a preceding REQ has seen an RVALID). Also existing other masters in your system would therefore not necessarily need to change. If you would go this route then of course no timing issues are eased at your system level, but also no system level changes are required (other than the added adapter to keep the original behavior). Downside is of course that the system would not use the modified Ibex/RI5CY as we would intend to use it (if a system was built from scratch).

davideschiavone commented 5 years ago

Hello @arjanbink , yes the wrapper may solve this problem as you said.

However, isn't there any solution to implement outside to solve your issue? Maybe playing with the grant and valid. Can I see the scenario you would like to optimize by removing the dependency?

Probably an idea of the protocol with the corresponding mapping to grant and valid may help.

Thanks a lot Davide

Silabs-ArjanB commented 5 years ago

Hi @davideschiavone ,

The scenario I try to achieve is as shown in figure 1 of the pdf file at the top of this issue report. We would like to enable an external adapter to AHB that does not cause pipelining on address phase info nor data phase signals and which does not introduce combinational paths from HREADYOUT to HTRANS (or any other address phase signal). Also note that on AHB it is not legal to change address phase info during waited NSEQ cycles (so during cycle 3 in the figure) (I could draw a similar figure for AXI if needed.) I don't see how we can get to my preferred scenario without removing the stated RVALID_I -> REQ_O path.

Also please note that I am not necessarily looking for a quick fix. I realize that what I am looking for might not be easy or low effort, so I am totally fine if this becomes a bit longer term change (as opposed to no change because there is no quick solution foreseen).

asb commented 5 years ago

Hi Arjan - we'd appreciate your review in #276 which works towards resolving this issue.

Silabs-ArjanB commented 5 years ago

Hi, thanks for the proposed fix; I am studying the updated RTL now and will provide feedback asap

asb commented 5 years ago

Hi Arjan - please take a look at #282 as well.

Silabs-ArjanB commented 5 years ago

Hi,

I will look into it today.

Best regards, Arjan

From: Alex Bradbury notifications@github.com Sent: torsdag 5. september 2019 09:49 To: lowRISC/ibex ibex@noreply.github.com Cc: Arjan Bink Arjan.Bink@silabs.com; Mention mention@noreply.github.com Subject: Re: [lowRISC/ibex] Combinational paths between RVALID_I/ERR_I and REQ_O (#265)

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Hi Arjan - please take a look at #282https://github.com/lowRISC/ibex/pull/282 as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/lowRISC/ibex/issues/265?email_source=notifications&email_token=AJWAIBCDGPEFV5NRBLCXZZDQIC2WXA5CNFSM4IQEHZLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD56F3DY#issuecomment-528244111, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJWAIBDQS4CGWKNBHUVMBITQIC2WXANCNFSM4IQEHZLA.

Silabs-ArjanB commented 5 years ago

Hi,

I donwloaded the new code, synthesized it and can confirm that the reported combinational paths are now gone. I looked at the RTL changes as well and I didn't notice anything weird, but to be honest I am not familiar enough with the prefetch buffer to really comment on it.

Thanks for taking care of this issue so quickly!

vogelpi commented 5 years ago

Thanks for your feedback @arjanbink . Both PRs are merged, I am now closing the issue.

lowRISC / ibex

Combinational paths between RVALID_I/ERR_I and REQ_O #265