ucb-bar / gemmini

Berkeley's Spatial Array Generator
Other
779 stars 160 forks source link

Verilator running CONV elf with IM2COl set true failed #120

Open AaronJing opened 3 years ago

AaronJing commented 3 years ago

Hello!

Gemmini is on branch dev and gemmini-rocc-test is at 000d2fd. After passing the CONV (I defined the fast in conv.c) test with default configs, I set hasIm2col = true in Configs.scala and reran the test. I received the following assertion failed.

[UART] UART0 is here (stdin/stdout).
Output dimension: 5

Randomize inputs...
Randomize weights...
Randomize bias...
CPU conv...
CPU conv took 1 cycles
Flatten weights...
Gemmini conv...
[138] %Error: chipyard.TestHarness.GemminiRocketConfig.top.v:257523: Assertion failed in TOP.TestHarness.chiptop.system.tile_prci_domain.tile_reset_domain.tile.gemmini.rob
%Error: /home/jing/chipyard/sims/verilator/generated-src/chipyard.TestHarness.GemminiRocketConfig/chipyard.TestHarness.GemminiRocketConfig.top.v:257523: Verilog $stop
Aborting...

The failed assertion comes from

Assertion failed: pipeline stall
     at ROB.scala:447 assert(cycles_since_issue < PlusArg("gemmini_timeout", 10000), "pipeline stall")
hngenc commented 3 years ago

hasIm2Col = true unfortunately doesn't work on dev right now. I believe this was caused by our shift to single-ported, instead of dual-ported, SRAMs.

We'll fix that when we have more bandwidth. Right now, the latest version of the conv instruction doesn't use the im2col module, so there wouldn't really be a performance gain in including it.

AaronJing commented 3 years ago

Thanks. Is there any working branch that supports Im2Col? I am interested in how im2col on the fly would work in the whole system. Meanwhile, how the hardware abstracts the im2col operation in convolution without im2col module?

hngenc commented 3 years ago

Sorry for the late response. I think that the im2col module may work if you set the sp_singleported option to false. im2col should also be working on master.

I am interested in how im2col on the fly would work in the whole system.

By default, Gemmini performs 16x16x16 matmuls in its spatial array.

A matmul is just a 3-nested for loop. Let's say the 3 iterators are called I, J, and K, as in the example below:

for (i in I)
    for (j in J)
        for (k in K)
            C[i][j] += A[i][k] * B[k][j]

Now, if I, J, or K are less than 16, then the spatial array may not be fully utilized. Some PEs may be left idle, which reduces total utilization.

The im2col module basically makes sure that I >= 16 for as many inner matmuls as possible.

Meanwhile, how the hardware abstracts the im2col operation in convolution without im2col module?

By default, Gemmini's convolutions parallelize convolutions over the output-column, output-channel, and input-channel dimensions. These correspond to I, J, and K in the example above.

If I does not divide cleanly onto 16, then the im2col module might improve performance. However, it does not affect correctness, it only improves total performance. The im2col module itself is only an address generator FSM that reads data from the scratchpad and feeds it into the spatial array.

In our older code, we used to convert convolutions to matmuls using im2col running on the CPU. With new updates to the code, we can run the convolutions directly on Gemmini, without running im2col on the CPU first. However, even in that case, the im2col module might still be helpful to improving performance. (But we would need to fix the im2col module on dev first).

That's a general overview, but let me know if you need me to explain in more depth.

AaronJing commented 3 years ago

Very well explained. Thank you. When I learned im2col from an algorithm perspective, I thought it would rearrange data layout somehow in hardware. It turns out it’s just an address generator.

I will rerun the test and update it in a few days.

AaronJing commented 3 years ago

Update: Gemmini is on dev branch and gemmini-rocc-tests is updated. Chipyard is on the given hash. With hasIm2Col = true and sp_singleported = false in default config, verilator failed and output the following.

Assertion failed: pipeline stall
     at ROB.scala:447 assert(cycles_since_issue < PlusArg("gemmini_timeout", 10000), "pipeline stall")

Which is the same as before.

On master branch, Im2col works. I run the conv.c with FAST defined and output correct results. However, this version took 10292 cycles while dev took only 1717 cycles. I guess convolution hardware mapping logic in 'dev' branch is better? Or is there overhead that comes from im2col when the input is too small?

hngenc commented 3 years ago

Thanks for the update. I guess we need to just fix the im2col bug in that case on dev.

I guess convolution hardware mapping logic in 'dev' branch is better? Or is there overhead that comes from im2col when the input is too small?

The Im2col module doesn't add any overhead. On dev, the convolution hardware mapping achieves much better overlapping between loads, stores, and executes than the older code on master does. The im2col unit gives a boost to performance all on it's own, but the improved overlapping on dev is much more significant than that.

If we fix the im2col module, then we might get better performance on the new conv-hardware as well, at least when the output-columns dimension is very small.