openpower-cores / a2i

Other
243 stars 40 forks source link

Error happens when two threads co-execute together #47

Open Grubby-CPU opened 2 years ago

Grubby-CPU commented 2 years ago

Hi Guys,

I have two threads co-execute together. One thread named T0 executes mfspr to read some values from slowspr to GPR and the other thread named T1 executes normal ALU instructions. I found this can cause errors since slowspr can return value later meanwhile the datapath of mfspr and ALU to the GPR is the same. In other words, T0 and T1 can write GPR at the same time with the same data path and control path. Any comments about this?

Many thanks

mikey commented 2 years ago

I'm not sure I follow your question sorry. SMT threads have independent GPRs so they won't affect each other.

Grubby-CPU commented 2 years ago

Different SMT threads use the same write port of the GPR. Normal ALU instructions from different threads do not have contention since only one instruction can be issued per cycle and all WB happens after EX6. However, reading slowspr returns data later which causes contention when mfspr and ALU instr. want to use the write port at the same time.

mikey commented 2 years ago

Sorry I'm still a bit lost. Are you concerned about a performance or functional issue? Certainly SMT threads can affect each others performance. They should not affect each other functionally.

openpowerwtf commented 2 years ago

Are you saying T0 mfspr issues before T1 issues ALU op, but the T0/T1 GPR writes are concurrent because of extra cycles in the mfspr pipe? I think that would either be dealt with at issue time or more likely as a resource collision T1 flush when it is detected.

openpowerwtf commented 2 years ago

Why is SMT required? Isn't a single thread susceptible to the same condition?

openpowerwtf commented 2 years ago

I didn't see the case in Table D-5, but D.4.6 (p.843) appears to cover it as a stall:

Slow SPRs Blocking instructions that stall any instructions following them until results are available.

Grubby-CPU commented 2 years ago

Hi mikey and openpowerwtf,

Thanks for the reply! As mentioned in D.4.6 (p.843), slow SPR only blocks instructions following it within the same thread. That's why the single thread does not have the problem. However, other threads can still issue ALU instructions which causes contention between different threads.

openpowerwtf commented 2 years ago

Where does it say that? I assumed 'any' did not imply 'any on this thread'. There are clearly interthread resource collisions for other ops like mul/div and they are described in Appendix D also.

Grubby-CPU commented 2 years ago

Hi openpowerwtf,

I verified this through simulation. In fact, A2 uses "hole" in the issue unit for the div instruction to avoid the interthread resource collisions.

Note A2 also tries to use "hole" here for slowspr read. Please see 14.5.130 XUCR0 - Execution Unit Configuration Register 0, 43:47 SSDLY.

Unfortunatelly, using default 7 cyclces does not aovid the collision. My question is where does 7 cycles come from? I guess you guys change the pipeline a bit when release the A2I code so 7 cycles do not work anymore.

openpowerwtf commented 2 years ago

No @Grubby-CPU , I did not change any functional source code except to get it to compile with Vivado (not intentionally :fearful: ).

But it IS possible there were configuration bits changed by firmware/boot code/etc. in real systems. Usually the default is 'good', but not guaranteed (late bug discovery, different core/system modes, etc.). It is interesting there are config bits for this exact case. The RO/config ring note means the setting could only be changed 'externally' - not by processor code directly.

In xuq_dec_sspr.vhdl, it looks like each thread does the slowspr check and then it becomes a global slowspr_need_hole, which then is a part of xu_iu_need_hole.

  1. Is this a single thread or SMT problem? I still don't see why ST vs SMT matters.
  2. Is this a certain SPR or all slow SPRs? From above, it seems like it's any slow SPR.
  3. Is the non-SPR op a specific one or just any GPR write op?
  4. Does initializing SSDLY to greater than 7 fix the problem? Since you can't use mtspr, you have to update the RTL to allow the field to be written, or change the init value.
Grubby-CPU commented 2 years ago

Hi Openpowerwtf,

  1. It is the SMT problem. For a single thread, is1_is_slowspr signal in iuq_fxu_dep.vhdl will set a barrier, i.e., is2_instr_is_barrier. This will block the later instructions within the same thread. That's why single thread does not have the collision problem.
  2. The collision can happen when reading any slow SPRs.
  3. Any GPR write except the load miss since the load miss uses other write port named w_l_act, w_l_addr_func, w_l_data_func of GPR.
  4. Yes, initializing SSDLY to greater than 7 can fix the problem. However, I am not sure this solution can handle all the cases. You konw many unexpectable corner cases can happen when the CPU is executing in the real world. That's why I created this question in order to know the reasons of setting 7 cycles.
openpowerwtf commented 2 years ago

Good info!

So I get from your analysis:

  1. For single-thread, a 'bubble' is inserted in the pipe. This eliminates the write-write collision.
  2. For multithread, the 'hole' should be created to prevent collision, but is not aligned correctly with the mfspr write.

The solution(?): I believe single-thread and multithread cases are equivalent. The slow mfspr has a constant writeback latency, and any following ALU write has minimum issue and writeback latencies. So there is a worst-case defined window (1 cycle?) where the collision occurs. Since the window is bounded, the stall will cover all cases if it's implemented properly.

I don't know why the stall and hole are implemented slightly differently, but would guess it's possible the single-thread case was designed in from the start and the multihread case 'evolved' during sim verification. Also, there also could have been timing issues. And there are no performance implications for a normal stream.

Grubby-CPU commented 2 years ago

Cool!

  1. For single-thread, as discussed before, D.4.6 (p.843) mentions that "Slow SPRs Blocking instructions that stall any instructions following them until results are available." Besides avoiding collision within the thread, are there any other reasons to keep stalling the later instructions? If the purpose is only to avoid collision, I agree it can also be achieved by using 'hole' made by slowspr_need_hole signal. That's why I wonder are there any other reasons.

  2. For multi-thread, the 'hole' actually creates one cycle bubble in the fxu_issue unit to make sure when reading slow spr writes the GPR, there are no other active instructions which also want to write the GPR at the same cycle. Since it is only a one-cycle window, it shoule be very precise and can cover all the cases. That's why I want to dig out all the details instead of simply setting a value larger than 7 which can pass the test case on the hand.

Grubby-CPU commented 2 years ago

Following Q1, I guess this is because A2 is an in-order core, when reading slowspr does not return data, the laters instructions are stalled to guarantee the in-order feature? However, even in this case, I still do not think it is necessary to stall all later instructions until the slowspr read is done since the reading slowspr is not stalled. It just returns the reading slowspr data in the fixed delayed cycles.

openpowerwtf commented 2 years ago

The 'slow spr' distinction is because reading those regs isn't important for performance, and the regs are likely placed throughout the core since they are often specific to certain macros/functions. Adding latency simplifies the wiring/timing requirements to funnel the data to the GPR; but this now adds some extra control complexity to handle the special case.

The stall is to prevent the hardware resource (GPR write port) collision, like you are seeing. You could handle some cases by finishing out-of-order and adding control complexity. But this case could also happen in out-of-order pipes. If ops have different latencies to a shared resource like the writeback bus, they have to be scheduled/stalled to avoid collisions.

Grubby-CPU commented 2 years ago

Thanks for the detailed explanation of slow spr! For single-thread reading Slow SPR collision, now I think it can also be removed by using 'hole' made by slowspr_need_hole signal. In this case, we do not necessarily need the original barrier logic, i.e., D.4.6 (p.843) , which blocks the following instructions within the thread. Is it ok to do this? Any comments:) ??