Possible deadlock in multicore config?

sxu55 commented 7 years ago

I spotted a potential deadlock scenario in the release channel design when there are multiple cores contending for the release channel for a single bank. Setup: Assuming that there are two cores and four L2 banks. In each L2 bank, we have two acq tracker and one rel tracker. State: acqTracker1 is occupying wb unit for back invalidation. The wb is probing cores and wait for their release response (e.g. waiting for release from core 1) At the same time, acqTracker 0 is also busy with a transaction, and wish to send a request to wb when it returns to idle. Deadlock: Now that core 1 is sending a release valid; simultaneously core 0 is sending a voluntary release to the same bank. In addition, core 0's release is in the same set as the addr in acqTracker 0. If the LockingRRArbiter chooses to wire core 0's release to the destination bank, there is a circular dependency that causes deadlock: Core 0's release is blocked (ready low) because it is in the same cache set as acqTracker0; acqTracker0 is blocked because it is waiting wb to become idle; wb is blocked because it is waiting for core 1's release; core 1's release is blocked because the arbiter is choosing core 0's release ...

I'd appreciate if anyone can help verify if this can indeed lead to a deadlock. A temp fix I made was to force the TL crossbar lockingArbiter to continuously switch the between valid clients when it is not in locked state. It helped in the workload that originally hangs, but I am not sure if this fix can cause any starvation.

wsong83 commented 7 years ago

Well, thanks for the question. It does take me at least an hour to reason it through.

So here is my general understanding: there are extreme corner cases in this old version of L2 which have not been handled properly; however, as we are not the original author, we do not know exactly what are these corner cases.

For your proposed scenario, right now, I think it is OK.

A key here is, the Acquire trackers cannot block a Release transaction.

If two concurrent transactions, one release one acquire, the release is served. If a release comes when an acquire in the same set but different way is being served by a acqTracker, the release is still accepted by the relTracker (key here is this release must hit in the cache due to inclusive) If an acquire comes when a release in the same set but different way is being served by a relTracker, the acquire is blocked.

So in your block scenario, Core 0's release should not be blocked by acqTracker 0.

sxu55 commented 7 years ago

Thanks a lot for your analysis.

If a release comes when an acquire in the same set but different way is being served by a acqTracker, the release is still accepted by the relTracker (key here is this release must hit in the cache due to inclusive)

I am not quite sure about his statement when looking at the cache code. My understanding is that when a voluntary release comes in, and if one acqTracker has matches signal high (io.matches.irel), then the routing function will wire acqTracker's ready (low) to the voluntary release request, thus resulting in a blocking.

io.matches.irel := (state =/= s_idle) && 
                       Mux(before_wb_alloc, irel_in_same_set, io.irel().conflicts(xact_addr_block))

They have such logic for generating the matches signal. In the case I mentioned, it is taking the result of irel_in_same_set, and the related function checks the index bits. I am not sure how it can identify if the ways are different or not.

wsong83 commented 7 years ago

You indeed have looked into the details and, yes, monsters are hidden in the details.

The line you pointed out is important. So what I said is even on a slightly higher level. To this control logic, it is when before_wb_alloc is false, acqTracker will no longer block the release that is on a different way (this is also important as only now, the acqTracker can be sure that it is not going to replace what the relTracker is operating on).

When two transactions are in the same set but different ways, in_same_set will be true but conflicts returns false. Conflicts return true only when the block addresses are equal. Here block address specifies a target cache block, not a set.

When the shared writeback unit is allocated to this acqTracker, the before_wb_alloc resets to false.

sxu55 commented 7 years ago

I will continue investigating this issue and report back if I find the root cause. Thx

wsong83 commented 7 years ago

feel free to open it again.

sxu55 commented 7 years ago

A quick question: In the L1 cache design, is there a reason to prioritize the writebackUnit's release over probeUnit's release (for the releaseArb)?

wsong83 commented 7 years ago

The writeback unit is actually shared by missing handlers and the probe unit. So actually here the releaseArb prioritizes a cache writeback over a probe ack (no data), which I think is the right decision. Then if you see wbArb, actually probe is prioritized over missing handlers. I feeling is this is also right to reduce the probe latency as a replacement writeback from L1 can normally wait with less control concern to deal with.

sxu55 commented 7 years ago

Thanks.

Regarding your answer last time:

To this control logic, it is when before_wb_alloc is false, acqTracker will no longer block the release that is on a different way (this is also important as only now, the acqTracker can be sure that it is not going to replace what the relTracker is operating on).

I think this means if the acqTracker has not grabbed the wb unit, it will detect set conflict instead of address conflict. I still feel this might lead to a deadlock: in the scenario I mentioned, there are two acqTrackers, acqTracker 1 has grabbed the wb and wait wb to finish, while acqTracker 0 is waiting to grab the wb unit so before_wb_alloc is high. If a voluntary release is set conflicting with acqTracker0, it will be blocked. Simultaneously, the wb is waiting for a non-voluntary release, but it is blocked by the voluntary release (further blocked by acqTracker0). Please correct me if I missed something.

wsong83 commented 7 years ago

Thanks! I think you might be on to something here. I am now kind of agree that there might be a deadlock if the voluntary release, only if it is an ack without data, is blocked exactly by the voluntary release.

However, since the logic here is pretty complicated and this code is developed by UC Berkeley, I think I can only be certain if there is a test case that can especially trigger this deadlock situation. A unit-test which directly generates cache transactions to L1 will do the job. Can you help me with this?

I think the easiest fix probably is to let the release arbiter prioritize the none-data probe ack. However, because this would not put our FPGA demos into immediate risk, I am currently considering fixing this with a low priority.

sxu55 commented 7 years ago

Sure. I am trying to make a minimal reproducible case and will let you know once I've got it.

I think fixing this issue will require at least three changes:

1) change the arbiter policy in the coherent_net crossbar. Their original code uses LockingRRArbiter in the crossbar. In such case, if the voluntary release gets the arbitration permission first, the system will deadlock because the arbiter will stay on the valid but non-ready voluntary release and never let non-voluntary release go through. In their design, the arbiter's ready is depending on the pre-chosen client signal, this is very similar to a combination loop... A non-intrusive fix is to let the locking arbiter automatically choose the next candidate when it is deadlocking for some cycles so as to resolve the lock.

2) in the nbdcache.scala, we need to prioritize the non-voluntary release over the voluntary release. I think switching the two inputs of releaseArb will help.

3) disable Tilelink release channel prebuffering. Even with the fix in 2), there might still be deadlock if the prebuffering is enabled. Suppose the release channel buffer is empty, the ready to a core will always be high until the first release request is buffered. In such case, if the voluntary release comes first, it will be enqueued into the buffer, locking the releaseArb, always get ready low, and blocking the later coming non-voluntary release.

sxu55 commented 7 years ago

I spotted another corner case today, indicating the three fixes above are not sufficient to fully prevent the release channel deadlock. The corner case happens when the L2 cache is in the pseudo-deadlock state mentioned in this thread. The deadlock is now in L1 nbdcache. Even though we can prioritize the probe releaseAck, when both MSHR and ProbeUnit try to grab the L1 wb unit (and when MSHR win the competition first), the Probe non-voluntary release will be blocked forever.

Since the fix of this case is not trivial, I am considering a potentially simpler fix. That is, make the number of WB unit in L2 match the number of acqTracker. If this is the case, each acqTracker will eventually grab its own WB unit, and detect address conflict instead of set conflict. I believe this can eliminate the WB unit resource dependency cycle, at the cost of some area and timing overhead.

wsong83 commented 7 years ago

I think no matter what the potential fix is, I need to have a detailed look of the cause of the deadlock once the test case is ready. The potential fix would be made at the minimal cost in maintaining the code and the performance as well. Once lowRISC is moved to the latest Rocket-chip, this will become legacy code.

I think now the priority is to produce a test case which can be used to confirm the deadlock. How to fix it will come afterwards.

sxu55 commented 7 years ago

I was a bit busy last week, sorry for the long wait. I've quickly reproduced the deadlock using a simple assembly program. Generally I bootstrapped two cores and use nop sled to pad the instructions to create such case (so it looks nasty...)

Configuration details: L2 cache: 4 banks, PLRU replacer, 128 sets, 2 ways, 2 acqTrackers L1 cache: 5 MSHRs, 8 sets and 1 way. L1toL2: MESICoherence L2toDRAM: MICoherence Addr Map: default config Core 0 reset vector: 0 Core 1 reset vector: 0x400002BC

Program flow: core 0 read two memory blocks in bank0, I denote the set of two blocks as set0 and set1 core 1 read two memory blocks with different tag in set0 and set1 (also in bank0) core 0 read two new memory blocks in set0 and set1 back to back in bank0. This will force two acqTrackers in bank0 to compete for a single writeback unit to back invalidate core0 data Simultaneously, let core 1 read a new memory block with different tag in set1, so core 1 is initiating a voluntary release at the same time

In this case, the core1's voluntary release goes to bank0 first, but is blocked by the acquireTracker due to set conflict. We expect to see core0's non-voluntary release is further blocked because the arbiter will not switch to another valid until the prev is serviced.

The tricky thing here is: with the setup above, one still cannot reproduce the deadlock because the crossbar locking arbiter will sometimes "intelligently" switch to core0's non-voluntary release even when core1's release come a cycle earlier. But this switch behavior is not guaranteed. It is dependent on the arbiter's internal last_grant signal.

Suppose a 9-input arbiter's last_grant is 0, and valid 6 is high first with ready low, and valid 5 is high later also with ready low. The internal chosen will switch to 5. However, if last_grant is 5, in such case the chosen signal will not switch. So the deadlock is indeed context dependent (depending on the previous granted client) Since constructing such last_grant sequence is so tedious, I choose to do a small hack: I replace the LockingRRArbiter in network.scala with my own TLockingRRArbiter. The only difference is that the initial value of last_grant is changed to 5 instead of 0. In such case, the arbiter will not switch and deadlock will happen in the case I created. You can see my mods in uncore/network.scala

Other changes I made is: rocket/tile.scala, frontend.scala ==> just a dumb way to set core1's reset vector to a different value src/configs.scala ==> all configs mentioned above

I've attached the vcd dump, the assembly program, and the patched files. You can find that the acqTracker0 and 1 in bank0 are stucked in state 3 and 4 forever.

testcase.zip

wsong83 commented 7 years ago

Thank you very much!

I am currently on leave and will be back next week. I will have a look once I am back in office.

wsong83 commented 7 years ago

I think I can confirm that the current L1 D$ / L2 does have this deadlock cases. The root cause, to my understanding, is that: When a L2 AcqTracker is blocked at s_wb_req and it also unfortunately blocks the Release response needed by the current writeback probe (by chain reaction in the Release buffer queue), both the L2 AcqTracker and the wriback unit are blocked due to waiting each other.

I agree that increasing the number of writeback units to the number of AcqTrackers should fix it. However, this method would introduce extra parallel allocation logic (more parallel trackers to control). Also the area overhead can be significant. The concurrent probes may cause more corner cases.

My current thought is to allow the ReleaseTracker to accept Release messages whenever possible. For the above deadlock scenario, I would like the ReleaseTracker to consume the originally blocked Release. (When a AcqTracker is in s_wb_req, even the incoming release is exactly the same cache block, serving the writeback from L1 before writing back the L2 cache block does not change metadata, as long as the L2 writing back starts after the L1 writeback is finished).

I have made the following changes to uncore/src/main/scala/cache.scala

diff --git a/src/main/scala/cache.scala b/src/main/scala/cache.scala
index e4c5c82..348330f 100644
--- a/src/main/scala/cache.scala
+++ b/src/main/scala/cache.scala
@@ -208,10 +208,12 @@ abstract class L2HellaCacheModule(implicit val p: Parameters) extends Module
     with HasL2HellaCacheParameters {
   def doInternalOutputArbitration[T <: Data : ClassTag](
       out: DecoupledIO[T],
-      ins: Seq[DecoupledIO[T]]) {
+      ins: Seq[DecoupledIO[T]],
+      req_enable: Bool = Bool(true)) {
     val arb = Module(new RRArbiter(out.bits, ins.size))
     out <> arb.io.out
-    arb.io.in <> ins 
+    arb.io.in <> ins
+    out.valid := req_enable && arb.io.out.valid
   }

   def doInternalInputRouting[T <: Bundle with HasL2Id](in: ValidIO[T], outs: Seq[ValidIO[T]]) {
@@ -418,8 +420,9 @@ class TSHRFile(implicit p: Parameters) extends L2HellaCacheModule()(p)

   // WritebackUnit evicts data from L2, including invalidating L1s
   val wb = Module(new L2WritebackUnit(nTransactors))
+  val wb_req_enable = (0 until nReleaseTransactors).map(i => !trackerList(i).io.inner.release.valid && !trackerList(i).io.busy).reduce(_&&_)
   val trackerAndWbIOs = trackerList.map(_.io) :+ wb.io
-  doInternalOutputArbitration(wb.io.wb.req, trackerList.map(_.io.wb.req))
+  doInternalOutputArbitration(wb.io.wb.req, trackerList.map(_.io.wb.req), wb_req_enable)
   doInternalInputRouting(wb.io.wb.resp, trackerList.map(_.io.wb.resp))

   // Propagate incoherence flags
@@ -476,6 +479,7 @@ class L2XactTrackerIO(implicit p: Parameters) extends HierarchicalXactTrackerIO(
   val data = new L2DataRWIO
   val meta = new L2MetaRWIO
   val wb = new L2WritebackIO
+  val busy = Bool(OUTPUT)
 }

 abstract class L2XactTracker(implicit p: Parameters) extends XactTracker()(p)
@@ -533,6 +537,7 @@ class L2VoluntaryReleaseTracker(trackerId: Int)(implicit p: Parameters) extends

   val s_idle :: s_meta_read :: s_meta_resp :: s_busy :: s_meta_write :: Nil = Enum(UInt(), 5)
   val state = Reg(init=s_idle)
+  io.busy := state =/= s_idle

   val xact = Reg(new BufferedReleaseFromSrc()(p.alterPartial({case TLId => p(InnerTLId)})))
   val xact_way_en = Reg{ Bits(width = nWays) }
@@ -630,6 +635,7 @@ class L2AcquireTracker(trackerId: Int)(implicit p: Parameters) extends L2XactTra

   val s_idle :: s_meta_read :: s_meta_resp :: s_wb_req :: s_wb_resp :: s_inner_probe :: s_outer_acquire :: s_busy :: s_meta_write :: Nil = Enum(UInt(), 9)
   val state = Reg(init=s_idle)
+  io.busy := state =/= s_idle

   // State holding transaction metadata
   val data_buffer = Reg(init=Vec.fill(innerDataBeats)(UInt(0, width = innerDataBits)))
@@ -813,10 +819,10 @@ class L2AcquireTracker(trackerId: Int)(implicit p: Parameters) extends L2XactTra
   // These IOs are used for routing in the parent
   val iacq_in_same_set = inSameSet(xact_addr_block, io.iacq().addr_block)
   val irel_in_same_set = inSameSet(xact_addr_block, io.irel().addr_block)
-  val before_wb_alloc = Vec(s_meta_read, s_meta_resp, s_wb_req).contains(state)
+  val before_wb_req = Vec(s_meta_read, s_meta_resp).contains(state)
   io.matches.iacq := (state =/= s_idle) && iacq_in_same_set
   io.matches.irel := (state =/= s_idle) && 
-                       Mux(before_wb_alloc, irel_in_same_set, io.irel().conflicts(xact_addr_block))
+                       Mux(before_wb_req, irel_in_same_set, state =/= s_wb_req && io.irel().conflicts(xact_addr_block))
   io.matches.oprb := Bool(false) //TODO

I will do more test tomorrow. Any thought for this fix? May be you can help me test for more multi-core test cases as we do not have any right now.

sxu55 commented 7 years ago

I agree that enforcing one-to-one correspondence between acq and wb is a no-brainer fix.. I think I got your idea, but I will need to do a careful reasoning tmr to remove some doubts. Also, I'd be happy to help test the patched design using the regression suite. Thanks!

wsong83 commented 7 years ago

A further change has been made to include alloc.irel in the wb_req_enable signal. You can see it in the PR to uncore. Now the PR has passed all single core isa regression and tagcache tests.

Wait for your feedback @sxu55 before applying the PR to lowrisc-chip master.

sxu55 commented 7 years ago

After some thoughts, I think your proposed fix should work. Below is my reasoning:

The original design prioritize the acq->wb allocation, and hope that the voluntary release, 1) if conflict with wb address, will eventually be consumed by the wb tracker once the wb allocation is done 2) if conflict with acq set but not cache line, will be unblocked once the wb allocation is done 3) if conflict with acq cache line, will be blocked until the acq tracker hit is serviced. Generally, the goal of the blocking is to ensure at each time, no two trackers are operating on the same cache line. The deadlock is because the release of wb resource is depending on release channel, which might be occupied by other non-related releases forever.

The fix you proposed is to break this allocation priority, let the voluntary release slip through and allocate release tracker first, simultaneously blocking the wb allocation if any release tracker is busy/is to be allocated. It will work because relTracker does not rely on TileLink signals to release the resource, so it will not block the allocation of wb forever

By looking at the code, I have two suggestions

1) in line 216, instead of masking the valid output using req_enable, I suggest to also mask the ready of the arb output. This is to avoid fake allocation success signal to acq trackers: if output ready in the arbiter is high, masking output valid only might still pass a ready high to the requesting acq tracker, letting it mistakenly think it has successfully allocated the wb unit

2) to improve the parallelism, we can relax the exclusive relationship between wb allocation and relTracker a bit. I think acq trackers can still allocate wb as long as the wb request addr does not match any of the addrs in the in-flight release trackers. Of course there will be a bit more CAM logic...

Please correct me if my two suggestions are incorrect. I am applying the patch and hopefully to get some results out in this week.

wsong83 commented 7 years ago

Thanks a lot! You just provided a better summary of what I have done to fix it.

For the first suggestion, nice catch! I will have it rectified as my first job tomorrow, or you would like to PR on it.

For the second suggestion, I have thought about it. Right now I am thinking simply blocking wb allocation whenever there is a release outstanding is good enough. This should be a rare event. Please let me know if you think otherwise. Also I would be happy to accept a PR on this block relaxation.

sxu55 commented 7 years ago

I think your patch is good, so I will not submit a PR on this. I don't have enough data to prove that blocking wb allocation will incur noticeable performance degradation, so I would choose to apply the current patch and ensure the functional correctness first. Will report back if I observe too many waiting cycles due to the conflict between these two components.

wsong83 commented 7 years ago

The patch has been applied to the latest lowRISC release (minion-v0.4) and backported to debug-v0.3. Thanks a lot for raising the issue and reviewing the bugfix.

lowRISC / lowrisc-chip

Possible deadlock in multicore config? #67