Open AaronJing opened 3 years ago
hasIm2Col = true
unfortunately doesn't work on dev
right now. I believe this was caused by our shift to single-ported, instead of dual-ported, SRAMs.
We'll fix that when we have more bandwidth. Right now, the latest version of the conv instruction doesn't use the im2col module, so there wouldn't really be a performance gain in including it.
Thanks. Is there any working branch that supports Im2Col
? I am interested in how im2col on the fly
would work in the whole system. Meanwhile, how the hardware abstracts the im2col
operation in convolution without im2col
module?
Sorry for the late response. I think that the im2col module may work if you set the sp_singleported
option to false
. im2col
should also be working on master
.
I am interested in how
im2col on the fly
would work in the whole system.
By default, Gemmini performs 16x16x16 matmuls in its spatial array.
A matmul is just a 3-nested for loop. Let's say the 3 iterators are called I, J, and K, as in the example below:
for (i in I)
for (j in J)
for (k in K)
C[i][j] += A[i][k] * B[k][j]
Now, if I, J, or K are less than 16, then the spatial array may not be fully utilized. Some PEs may be left idle, which reduces total utilization.
The im2col module basically makes sure that I >= 16 for as many inner matmuls as possible.
Meanwhile, how the hardware abstracts the im2col operation in convolution without im2col module?
By default, Gemmini's convolutions parallelize convolutions over the output-column, output-channel, and input-channel dimensions. These correspond to I, J, and K in the example above.
If I does not divide cleanly onto 16, then the im2col module might improve performance. However, it does not affect correctness, it only improves total performance. The im2col module itself is only an address generator FSM that reads data from the scratchpad and feeds it into the spatial array.
In our older code, we used to convert convolutions to matmuls using im2col running on the CPU. With new updates to the code, we can run the convolutions directly on Gemmini, without running im2col on the CPU first. However, even in that case, the im2col module might still be helpful to improving performance. (But we would need to fix the im2col module on dev first).
That's a general overview, but let me know if you need me to explain in more depth.
Very well explained. Thank you. When I learned im2col from an algorithm perspective, I thought it would rearrange data layout somehow in hardware. It turns out it’s just an address generator.
I will rerun the test and update it in a few days.
Update:
Gemmini is on dev
branch and gemmini-rocc-tests is updated. Chipyard is on the given hash.
With hasIm2Col = true
and sp_singleported = false
in default config, verilator failed and output the following.
Assertion failed: pipeline stall
at ROB.scala:447 assert(cycles_since_issue < PlusArg("gemmini_timeout", 10000), "pipeline stall")
Which is the same as before.
On master
branch, Im2col works. I run the conv.c with FAST defined and output correct results. However, this version took 10292 cycles while dev
took only 1717 cycles. I guess convolution hardware mapping logic in 'dev' branch is better? Or is there overhead that comes from im2col when the input is too small?
Thanks for the update. I guess we need to just fix the im2col bug in that case on dev
.
I guess convolution hardware mapping logic in 'dev' branch is better? Or is there overhead that comes from im2col when the input is too small?
The Im2col module doesn't add any overhead. On dev
, the convolution hardware mapping achieves much better overlapping between loads, stores, and executes than the older code on master
does. The im2col unit gives a boost to performance all on it's own, but the improved overlapping on dev
is much more significant than that.
If we fix the im2col module, then we might get better performance on the new conv-hardware as well, at least when the output-columns dimension is very small.
Hello!
Gemmini is on branch
dev
andgemmini-rocc-test
is at 000d2fd. After passing the CONV (I defined the fast inconv.c
) test with default configs, I sethasIm2col = true
inConfigs.scala
and reran the test. I received the following assertion failed.The failed assertion comes from