joaomiguelvieira / gem5-accel

BSD 3-Clause "New" or "Revised" License
4 stars 1 forks source link

The implementation of Gemmini #1

Open hanm2019 opened 7 months ago

hanm2019 commented 7 months ago

Thanks for your work on modeling the DNN accelerator. I have some questions about the Gemmini implementation in gem5-accel. The critical parameter of GEMMINI is the size of the PE array, the data volume of SPM, and the TLB and PTW size. However, I need help finding these configurations.

The SPECIFIC METHODS of gemmini (src/gemmini_dev_a/gemmini_dev.a.cc) is how to compute the kernel on the CPU. How does it model the accelerator behavior?

joaomiguelvieira commented 7 months ago

Hi, thanks for your interest in this work.

I understand that you are interested in the implemented model of Gemmini above all. However, gem5-accel is much broader than that, and the provided Gemmini Dev A is just a model of an accelerator inspired in Gemmini. Therefore, many of the features offered by the Gemmini framework, such as the size of the PE array and the TLB are somehow hardcoded in this model, or may not be available at all.

For example, gem5-accel avoids implementing explicit memory translation mechanisms and operates directly over physical addresses. Hence, it does not need a TLB. The array size, on the other hand (which is in the example 16-by-16), is enforced by the volume of data over which the accelerator operates at a given moment.

More generically, the actual timing behavior of the accelerator (how long it takes to generate the output from the input) is encoded by the scheduling of events, while the functional behavior (the recipe for generating the output) is just the same as the CPU's.

hanm2019 commented 7 months ago

Thanks for your response. I successfully run the script(configs/gemmini/se-run.py), the output is:

status bench        | sw_time | hw_time | mem_time | speedup
------------------------------------------------------------
[PASS] conv2d       |  123352 |      34 |        0 |    3628.00
[PASS] conv2d_gemm  |   12615 |      29 |     8670 |    2.45
[PASS] conv3d       |  347398 |      34 |        0 |    10217.59
[PASS] conv3d_gemm  |   11375 |      49 |     7372 |    2.53
[PASS] maxpool      |    8346 |      32 |        0 |    260.81
[PASS] maxpool_gemm |    6466 |      30 |     7896 |    1.81
[PASS] relu         |    7984 |      34 |        0 |    234.82
[PASS] mm           |  395581 |      15 |        0 |    26372.07
[PASS] mm_gemm      |  408956 |      15 |     1966 |    207.43

Take conv2d as an example. If the kernel is run by the CPU, it takes 123352 cycles. When it runs on Gemmini, it takes 34 cycles. I can not find how the 34 cycles are accounted for.

https://github.com/joaomiguelvieira/gem5-accel/blob/3ecdc183e6f15fc15f65167f9022ca32f375ef2b/src/gemmini_dev_a/gemmini_dev_a.cc#L109

The process_fsm() is the key function for modeling the GEMMINI behavior. It accesses data twice if necessary. Then perform the baseConv_2D when the kernel is conv2d.

https://github.com/joaomiguelvieira/gem5-accel/blob/3ecdc183e6f15fc15f65167f9022ca32f375ef2b/src/gemmini_dev_a/gemmini_dev_a.cc#L204-L229

Does it mean the baseConv_2D function takes GEMMINI 34 cycles? It would be extremely grateful if you could explain how Gem5-accel is computing the delay on the baseConv_2D function.

joaomiguelvieira commented 7 months ago

Hi,

That is correct. For that volume of data, the accelerator takes 34 cycles to process the input. In this specific case, it was determined that the workload is still memory-bound (given the massive amount of hardware resources available in the accelerator). Hence, those 34 cycles are the time it takes to load the batches of operands from the cache, process them, and store the results back in memory. As the accelerator is pipelined, the stages of loading the operands, computing the results, and storing the results in memory may be all active at once.

Hence, those 34 cycles are determined by how fast can the accelerator fetch the operands from the cache.

hanm2019 commented 7 months ago

Hi, I have another question. Do you plan to upload files that simulate the full convolutional neural network? Such as the MicroBenchmarks in Fig. 4 of the paper gem5-accel?

fotuodunwu commented 1 month ago

So, since we cannot define hardware specific structures and model custom accelerator behavior, what use is this source code to architects?

joaomiguelvieira commented 1 month ago

So, since we cannot define hardware specific structures and model custom accelerator behavior, what use is this source code to architects?

Hi,

You can model hardware in gem5-accel just as you did in gem5, and make use of gem5-accel custom hardware structures to install accelerators close to memory resources and couple them with the CPU.

Before going on the internet being unpleasant and pointing fingers to someone else's work, you should read the existing documentation. There are published scientific articles (three of them: gem5-ndp, gem5-accel, and NDPmulator) that explain how to use gem5-accel. On the other hand, if you do not know how to use gem5 for architectural modeling, you should definitely join the gem5 users community and ask your questions there.

fotuodunwu commented 1 month ago

Thank you very much for your response. I apologize if my previous statement seemed impolite; that was not my intention. Since English is not my first language, I greatly appreciate you pointing out my issues. I will make every effort to improve. Thank you once again for your understanding and assistance.