Closed msfschaffner closed 2 years ago
According to the nightlies the last time that this was showing <100 pass rate was https://reports.opentitan.org/hw/top_earlgrey/dv/2022.08.24_15.46.18/report.html with 40% success rate.
It appears that in all nightlies after that this is resolved. Not sure which PR would have resolved this. This is a candidate for closing / resolved once tagged with PR that resolved this.
Edit: Looks like this was resolved through the update to riscv-dv per https://github.com/lowRISC/ibex/commit/4990aa268428e51a1c65d2d467f0b3bda2371925
Tagging @weicaiyang in case there was something else that resolved this chip level test. If this test was resolved by riscv-dv upgrade please just close out the issue, otherwise please comment with the PR that resolved this and close out. TIA
Thanks @johngt for looking at this issue. Since this test has been passed in past 4 nightly, I think we can close it. It's probably fixed when we fix some other issues. I'm not sure if it's because of the update in riscv-dv, but it's very unlikely as we don't involve riscv-dv in the chip-level
This test is not passing consistently, so I think it's worth investigating.
Tests | latest | 2022.10.22 | 2022.10.21 | 2022.10.17/2 | 2022.10.17/1 | 2022.10.15 | 2022.10.14 | 2022.10.12/2 | 2022.10.12/1 | 2022.10.11 | Suite |
---|---|---|---|---|---|---|---|---|---|---|---|
chip_same_csr_outstanding | 95.00 | 95.00 | 100.00 | 95.00 | 100.00 | 100.00 | 100.00 | 95.00 | 95.00 | 100.00 | tl_d_outstanding_access, tl_d_partial_access |
UVM_ERROR @ 7290.395290 us: (tl_monitor.sv:179) uvm_test_top.env.m_tl_agent_chip_reg_block.monitor [uvm_test_top.env.m_tl_agent_chip_reg_block.monitor] Number of pending a_req exceeds limit 9
UVM_INFO @ 7290.395290 us: (uvm_report_catcher.svh:705) [UVM/REPORT/CATCHER]
--- UVM Report catcher Summary ---
GH revision: 9c0b24ddb util/dvsim/dvsim.py hw/top_earlgrey/dv/chip_sim_cfg.hjson -i chip_same_csr_outstanding --build-seed 2483310614 --fixed-seed 850130150
@engdoreis Thanks for opening this. I changed the max outstanding number to be 8 recently. Looks like the xbar may take 9 outstanding items. Need to dump a waveform and confirm it with designer. If you haven't looked into this, I can take it.
@eunchan @tjaychen
Looks like the ibex TL data port (tb.dut.top_earlgrey.u_rv_core_ibex.tl_adapter_host_d_ibex.tl_o/tl_i) can actually take 9 outstanding items. Sounds like the async fifos in xbar takes 8 items and peripheral takes 1 item. Can you help confirm if this is expected? If so, I can make a PR to increase the max outstanding to 9.
I reproduced the failure here.
/mnt/disks/filestores/opentitan-shared/users/weicai/scratch/HEAD/chip_earlgrey_asic-sim-vcs/0.chip_same_csr_outstanding/latest
hmm.. i'm not sure this was a conscious choice. It's more like the async fifo's are able to absorb a certain number of transactions to cover the clock crossing latency. I don't think it absorbs 8 though, (i thought it was either 2 or 4), so we'd have to take a look to see what's happening there.
I checked the generated codes. As of now, TL async socket has 4 depth of req and rsp FIFOs. So total 8 req (4 + 4) can be outstanding inside xbar.
At that time the async FIFO should be greater than 2. So the depth is set to 4. I think current async FIFO can support depth of 2. We can revise Xbar and reduce the number of outstanding also.
On Oct 24, 2022, at 4:57 PM, tjaychen @.***> wrote:
hmm.. i'm not sure this was a conscious choice. It's more like the async fifo's are able to absorb a certain number of transactions to cover the clock crossing latency. I don't think it absorbs 8 though, (i thought it was either 2 or 4), so we'd have to take a look to see what's happening there.
— Reply to this email directly, view it on GitHub https://github.com/lowRISC/opentitan/issues/14204#issuecomment-1289793526, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJDG3XARL7OJG4IC4A4WRTWE4O5NANCNFSM56GGMAWQ. You are receiving this because you were mentioned.
hmm... i took a peek at the timestamp @weicaiyang pointed me to where the test was erroring, and at least the tlul async fifo's response and request fifos all had write depths of 0.
I think my larger question is..what exactly are we testing? Because usually when I think of outstanding capacity, I think about a host's ability to continually issue before it sees a response. So that would usually be a fixed number on the host side. Or we could think of it as a device's ability to keep accepting transactions before it responds (in the case of a memory controller). But I'm not quite sure what we intend to do here with the fabric. The fabric should certainly be able to support at least the number of outstanding transactions a host can issue, but it's not clear what placing an upper limit number on it means.
Is the test trying to say the fabric should be able to support "at least" a certain number? Because that would make sense to me, but making sure it's smaller than a certain number...I'm not sure what that means.
hmm... i took a peek at the timestamp @weicaiyang pointed me to where the test was erroring, and at least the tlul async fifo's response and request fifos all had write depths of 0.
I think my larger question is..what exactly are we testing? Because usually when I think of outstanding capacity, I think about a host's ability to continually issue before it sees a response. So that would usually be a fixed number on the host side. Or we could think of it as a device's ability to keep accepting transactions before it responds (in the case of a memory controller). But I'm not quite sure what we intend to do here with the fabric. The fabric should certainly be able to support at least the number of outstanding transactions a host can issue, but it's not clear what placing an upper limit number on it means.
Is the test trying to say the fabric should be able to support "at least" a certain number? Because that would make sense to me, but making sure it's smaller than a certain number...I'm not sure what that means.
Let me clarify. This is for a covergroup which comes from TL agent. Since we hijack the ibex data port with a TL agent, this coverage is added automatically. Now we see our CSR tests can occasionally hit 9 outstanding items. At the time stamp of the error, it queues 9 outstanding reqs with 9 different source ID. These reqs are accepted in A channel, but D channel rsp haven't completed.
We could either set the right max outstanding number or exclude this coverage. Since ibex can only support 2 outstanding item, this limit doesn't really matter. But may be worth understanding why it supports much more than the host could send. As @eunchan mentioned, this may be the area we could optimize.
Thanks @weicaiyang for working on this.
@weicaiyang i might have to check in with you later. I'm not sure if i'm looking at the wrong thing. When I looked at the failing timestamp, it was claiming that none of the FIFO's had any entries, so I might be looking at the wrong place.
thanks @weicaiyang and @eunchan for explaining. I finally see what's happening. It definitely makes sense to reduce the fifo depth...especially on the response side. I think our core in general does not issue new transactions until it sees a response to the previous (with the exception of a few cases that do not go to peripherals).
But we should probably double check with @GregAC to be sure before changing.
alright! i checked in with @GregAC. On ibex data side, we only issue 1 transaction at a time unless its a misaligned instruction. The instruction side can issue more, but that would never go through the async fifo path anyways.
So I think I will follow along @weicaiyang / @eunchan suggestion of shrinking the the async fifo depth.
SG, thanks @tjaychen
This test is still failing. Let me take a look.
2 observations
waves: /mnt/disks/filestores/opentitan-shared/users/weicai/scratch/top_km/chip_earlgrey_asic-sim-vcs/0.chip_same_csr_outstanding/latest
The UVM_ERROR shows it hits 5 outstanding items.
Sorry, I didn't test my update thoroughly, and now this affects many CSR tests.
Please skip looking at this error in other tests.
UVM_ERROR @ 5596.731437 us: (tl_monitor.sv:179) uvm_test_top.env.m_tl_agent_chip_reg_block.monitor [uvm_test_top.env.m_tl_agent_chip_reg_block.monitor] Number of pending a_req exceeds limit 4
These are actually sync FIFO's inside socketm1 and socket1n. @eunchan and I actually went through this some time ago and reduced a lot of the depths to 0 already
Did we miss some of them..?
i thought all the n1s were instantiated with 0's? is that not the case? The m1 one does make sense to default it to 1.
i thought all the n1s were instantiated with 0's? is that not the case? The m1 one does make sense to default it to 1.
I'm afraid not. Not all the parameters are set in xbar_main. Please look at the lines I pointed in the links below. Those still use default parameters which set depth to 2. And the depth passes down to a fifo.
This one in socket m1 is still using 2. So is this one in socket n1.
hm? but those just link to the socket1n module. Is there an instantiation of socket1n where you do not see it being set to 0? socketm1 i think that was indeed the case, but we can probably change the default.
xbar_main.sv
and xbar_peri.sv
instantiates those modules with Depth 0 always, as far as I know. So, they are fine.
let me correct. I checked xbar_main.sv
. As Tim mentioned, socket_m1
has D side FIFO (depth of 2).
Above my comments are incorrect. Weicai will add a comment here for the summary of the offline discussion.
As discussed with @eunchan offline, perhaps we need to update max_outstanding to 6
flash mem can take 2 outstanding req and it goes through this m1 socket, which can take 2 items in each req and rsp fifo.
tlul_socket_m1 #(
.HReqDepth (12'h0),
.HRspDepth (12'h0),
.DReqPass (1'b0),
.DRspPass (1'b0),
.M (3)
) u_sm1_31 (
....
The other M1 don't have more than 2 items in each req/rsp fifo and the other IPs don't support higher than 2 outstanding req.
The socket 1n that ibex D port goes through doesn't have extra fifo. So, we will have max_outstanding = 6 for ibex D port.
@tjaychen does this make sense to you?
EDIT: flash mem isn't a regular RW mem, which doesn't simply return the value we program, so CSR tests don't test it at all. The sram mem can also take 2 outstanding req, but its m1 socket doesn't have a fifo. Hence, the max_outstanding should be 5
Thanks to Tim, this should be fixed by #15918. the xbar can hold maximum 2 items now, plus another one in peripheral, the total is 3 which is what we set on DV side.
Closing this issue as the tests passed 4 regressions already:
Tests | 2022-11-06 07:11:02 | 2022-11-05 07:11:23 | 2022-11-04 07:12:05 | 2022-11-03 07:11:23 | 2022-11-02 07:03:26 | 2022-11-01 07:10:18 | 2022-10-27 07:09:48 | 2022-10-26 07:06:46 | 2022-10-25 07:06:33 | 2022-10-24 07:05:24 | Suite | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | chip_csr_aliasing | 100 | 100 | 100 | 60 | 80 | 60 | 100 | 100 | 100 | 100 | csr_aliasing, regwen_csr_and_corresponding_lockable_csr, tl_d_outstanding_access, tl_d_partial_access |
5 | chip_csr_rw | 100 | 100 | 100 | 100 | 95 | 65 | 100 | 100 | 100 | 100 | csr_rw, regwen_csr_and_corresponding_lockable_csr, tl_d_outstanding_access, tl_d_partial_access |
20 | chip_same_csr_outstanding | 100 | 100 | 100 | 100 | 70 | 70 | 100 | 95 | 100 | 100 | tl_d_outstanding_access, tl_d_partial_access |
Hierarchy of regression failure
Chip Level
Failure Description
UVM_ERROR @ us: (cip_base_vseq.sv:250) [chip_common_vseq] Check failed masked_data == exp_data ( [] vs []) addr read out mismatch has 1 failures:New error message: UVM_ERROR @ 5596.731437 us: (tl_monitor.sv:179) uvm_test_top.env.m_tl_agent_chip_reg_block.monitor [uvm_test_top.env.m_tl_agent_chip_reg_block.monitor] Number of pending a_req exceeds limit 4
Steps to Reproduce
GH revision: 9c0b24ddb util/dvsim/dvsim.py hw/top_earlgrey/dv/chip_sim_cfg.hjson -i chip_same_csr_outstanding --build-seed 2483310614 --fixed-seed 850130150
Tests with similar or related failures
chip_csr_aliasing chip_csr_rw