scalesim-project / scale-sim-v2

Repository to host and maintain scale-sim-v2 code
MIT License
219 stars 92 forks source link

GEMM OS cycle count, prefetch and demand matrices sizes #60

Closed lkartik closed 1 month ago

lkartik commented 1 year ago

Dear all,

Thanks for writing this simulator. I am trying to understand GEMM on systolic arrays with Output stationary dataflow and I am using scalesim for that. I have a few questions about that. If you could answer would be great:- I don't understand the shape of prefetch and demand matrices for GEMM Output Stationary operations. For example)

For a GEMM multiplication with MNK 2,2,2 with Output Stationary on a 2x2 systolic array: 1) ifmap_prefetch_mat: (1, 4), filter_prefetch_mat: (1, 4) 2) ifmap_demand_mat: (4, 2), filter_demand_mat: (1, 4), ofmap_demand_mat: (4, 2) Re: 2) if filter demand matrix shape is 1,4 how do we use the second column in the systolic array? 🤔

For simulation time: On paper and pencil, I get 4 cycles. Scalesim gives 3 cycles....

For a GEMM multiplication with MNK 4,4,4 with Output Stationary on a 2x2 systolic array: 1) ifmap_prefetch_mat: (1, 16), filter_prefetch_mat: (1, 16) 2) ifmap_demand_mat: (10, 4), filter_demand_mat: (1, 16), ofmap_demand_mat: (10, 4)

Re. simulation simulation, I get 10 cycles on pen and paper. 9 cycles from the simulation.. Questions) 1) Are prefetch matrices are calculated based on number of elements to be prefetched from DRAM? 2) How are the demand matrix dimensions calculated for GEMM? 3) Compared to scalesim, always I get one more cycle extra for GEMM OS in my paper and pencil calculations. Please find slides attached.

Kind Regards, Kartik, PhD student, Ghent University systolic_matmul_4x4.pptx

ritikraj7 commented 1 month ago

Hi @lkartik,

Are prefetch matrices are calculated based on number of elements to be prefetched from DRAM?

yes. For example, in your first example, it's a GEMM between an input matrix of size 2x2 and a filter matrix of size 2x2. So, we need to prefetch 4 elements each for input and filter matrices from DRAM.

How are the demand matrix dimensions calculated for GEMM?

The size of all the demand matrices will be same if you keep the systolic array square (cycles*array_width/height). For example, in your first example, it would be 4x2 and your second example, it would be 10x4. Your observation of filter demand matrix is incorrect. You can look at IFMAP_SRAM_TRACE.csv, FILTER_SRAM_TRACE.csv and OFMAP_SRAM_TRACE.csv for ifmap, filter and ofmap demand matrices respectively. You can see at which cycle, which addresses are being sent from SRAM to systolic array (ifmap and filter) and which addresses are computed (ofmap).

Compared to scalesim, always I get one more cycle extra for GEMM OS in my paper and pencil calculations. Please find slides attached.

If you look at OFMAP_SRAM_TRACE.csv, you will find that the cycle count starts from 0. If you start from 0 in your pen and paper calculations, you will get the same answer.

Feel free to reopen this issue if you still have any questions.