jbush001 / NyuziProcessor

GPGPU microprocessor architecture
Apache License 2.0
1.96k stars 348 forks source link

May be a icache miss thread still can be scheduled again? #198

Closed zhangliang1997 closed 3 years ago

zhangliang1997 commented 3 years ago

Hi jbush001: I think i find a bug, but i'm not sure. In ifetch_tag_stage.v, there are some codes about thread scheduled.

    assign can_fetch_thread_bitmap = ts_fetch_en & ~icache_wait_threads;
    assign thread_sleep_mask_oh = cache_miss_thread_oh & {`THREADS_PER_CORE{ifd_cache_miss}};
    assign icache_wait_threads_nxt = (icache_wait_threads | thread_sleep_mask_oh)
        & ~l2i_icache_wake_bitmap;

I think a icache miss thread still can be scheduled, for example:

In cycle n, if we select 0 thread to fetch instruction; In cycle n+1, 0 thread icache miss in ifetch_data_stage, in the same time(cycle n+1), we will select a thread to fetch instruction. Because 0 thread icache miss, so in n+1 cycle, 0 thread can't have opportunity to be scheduled or arbitrated.

So i think can_fetch_thread_bitmap should be:

    assign can_fetch_thread_bitmap = ts_fetch_en & ~icache_wait_threads & ~thread_sleep_mask_oh;

So is there any problem about these code ? or just i don't have a thorough understand about code?

jbush001 commented 3 years ago

Good catch! Your observation is correct that it may schedule the thread unnecessarily if there is a miss, and that adding ~thread_sleep_mask_oh would prevent that. It actually used to be implemented the way you suggested. The reason I changed it is that it ended up being a long combinational path that significantly reduced the maximum clock speed. My workaround was to invalidate the instruction (which injects a bubble, but is otherwise correct). This wastes a clock cycle, but should be relatively infrequent, and my premise is that the improvement in clock speed would outweigh that.

Here is the change where I removed that logic:

https://github.com/jbush001/NyuziProcessor/commit/33484aac4369ae9e1920c1aac3994639fdb79fd2

Later I removed even more:

https://github.com/jbush001/NyuziProcessor/commit/7064fe44643959f37ff0f4a05f702a28724c7931

I was not able to think of a way to reduce the combinational delay in that logic and avoid this.

zhangliang1997 commented 3 years ago

Thanks very much ! I have understanded it. By the way, i learned a lot from nyuzi, such as the verilog implement of the gpgpu. Before i find nyuzi, i just learned gpgpu architecture from gpgpusim. After l read nyuzi code, l learn a lot "real" things. This is a very cool project! Thanks again!

jbush001 commented 3 years ago

Thanks, I'm glad you found it useful! :)