google / CFU-Playground

Want a faster ML processor? Do it yourself! -- A framework for playing with custom opcodes to accelerate TensorFlow Lite for Microcontrollers (TFLM). . . . . . Online tutorial: https://google.github.io/CFU-Playground/ For reference docs, see the link below.
http://cfu-playground.rtfd.io/
Apache License 2.0
471 stars 120 forks source link

Running with renode works, but how to show the number of ticks of each operation #718

Closed Siris-Li closed 2 years ago

Siris-Li commented 2 years ago

I follow the docs in my VM ubuntu 20.04. It seems that everything works well. After I type make renode in CFU-Playground/proj/example_cfu, I can get this:

Hello, World!
CFU Playground
==============
 1: TfLM Models menu
 2: Functional CFU Tests
 3: Project menu
 4: Performance Counter Tests
 5: TFLite Unit Tests
 6: Benchmarks
 7: Util Tests
 8: Embench IoT
main> 

However, when I try to run the golden test, I will get the following result.

Copied 9216 bytes at 0x400f19b0
Running pdti8
...............................
"Event","Tag","Ticks"
0,DEPTHWISE_CONV_2D,lu
1,DEPTHWISE_CONV_2D,lu
2,CONV_2D,lu
3,DEPTHWISE_CONV_2D,lu
4,CONV_2D,lu
5,DEPTHWISE_CONV_2D,lu
6,CONV_2D,lu
7,DEPTHWISE_CONV_2D,lu
8,CONV_2D,lu
9,DEPTHWISE_CONV_2D,lu
10,CONV_2D,lu
11,DEPTHWISE_CONV_2D,lu
12,CONV_2D,lu
13,DEPTHWISE_CONV_2D,lu
14,CONV_2D,lu
15,DEPTHWISE_CONV_2D,lu
16,CONV_2D,lu
17,DEPTHWISE_CONV_2D,lu
18,CONV_2D,lu
19,DEPTHWISE_CONV_2D,lu
20,CONV_2D,lu
21,DEPTHWISE_CONV_2D,lu
22,CONV_2D,lu
23,DEPTHWISE_CONV_2D,lu
24,CONV_2D,lu
25,DEPTHWISE_CONV_2D,lu
26,CONV_2D,lu
27,AVERAGE_POOL_2D,lu
28,CONV_2D,lu
29,RESHAPE,lu
30,SOFTMAX,lu
Perf counters not enabled.
   123M (    122687797 )  cycles total
OK   Golden tests passed
---

Though it says Golden tests passed, I cannot get the number of “ticks” each operation took to complete. For example, 0,DEPTHWISE_CONV_2D,lu. It shows lu here instead of a number like 0,DEPTHWISE_CONV_2D,7892. I wonder whether it is beacause the renode itself, or due to my neglection of anything? By the way, as Perf counters not enabled. says, I'd like to know how to enable perf counters?

tcal-x commented 2 years ago

Hi @limingxuan-pku !

Thank you for filing the issue! There are a few topics here.

More generally, Renode is great for debugging, but not the best for performance analysis, mainly because of the above reason -- it does not use a cycle-accurate simulation of the CPU that you're actually using.

To run in Verilator simulation with perf counters, here is a recipe:

make EXTRA_LITEX_ARGS="--cpu-variant=perf+cfu" PLATFORM=sim load

Yes, it's pretty slow!

NOTE: I find that my terminal is messed up after exiting the Verilator simulation. I need to type reset<enter> to restore proper behavior.

mithro commented 2 years ago

FYI - @PiotrZierhoffer

tcal-x commented 2 years ago

More generally, Renode is great for debugging, but not the best for performance analysis, mainly because of the above reason -- it does not use a cycle-accurate simulation of the CPU that you're actually using.

In fairness to Renode, I should say that this is specific to our current use in CFU Playground, not to Renode in general. I believe that recently Renode added funcionality allowing you to swap in a Verilog cycle-accurate model for the CPU as well, in which case you'd get full accuracy and also be able to use the performance counter CSRs. However, this is not implemented in CFU Playground.

Siris-Li commented 2 years ago

More generally, Renode is great for debugging, but not the best for performance analysis, mainly because of the above reason -- it does not use a cycle-accurate simulation of the CPU that you're actually using.

In fairness to Renode, I should say that this is specific to our current use in CFU Playground, not to Renode in general. I believe that recently Renode added funcionality allowing you to swap in a Verilog cycle-accurate model for the CPU as well, in which case you'd get full accuracy and also be able to use the performance counter CSRs. However, this is not implemented in CFU Playground.

Thanks for your patient and detailed interpretations! I have suceessfully solved the problem about lu.

When it comes to perf CSRs, I made some changes and got the following result.

PiotrZierhoffer commented 2 years ago

@limingxuan-pku if you're interested in more detailed analysis of the execution in Renode, please take a look at the Guest application profiling chapter in one of our latest note: https://antmicro.com/blog/2022/09/execution-tracing-in-renode/

Long story short, you can create execution traces similar to what you can see here: https://zephyr-dashboard.renode.io/renodepedia/boards/hifive1/?view=software&demo=Hello_World#trace

But as @tcal-x said, Renode is not a cycle-accurate simulator. You will find we're accurate on an instruction level, unless you're using some kind of RTL co-simulation,

Siris-Li commented 2 years ago

@limingxuan-pku if you're interested in more detailed analysis of the execution in Renode, please take a look at the Guest application profiling chapter in one of our latest note: https://antmicro.com/blog/2022/09/execution-tracing-in-renode/

Long story short, you can create execution traces similar to what you can see here: https://zephyr-dashboard.renode.io/renodepedia/boards/hifive1/?view=software&demo=Hello_World#trace

But as @tcal-x said, Renode is not a cycle-accurate simulator. You will find we're accurate on an instruction level, unless you're using some kind of RTL co-simulation,

Thanks for your advice. I will read these docs carefully to learn more about Renode. ;-)

Siris-Li commented 2 years ago

@tcal-x Well, the next difficulty comes so soon. ╯︿╰ I try to follow The Step-by-Step Guide to Building an ML Accelerator. When I come to utilize the performance counters built into the soft-CPU to count how many cycles one loop takes, the result makes me confused.

for (int in_channel = 0; in_channel < filter_input_depth;++in_channel) { int32_t input_val = input_data[Offset(input_shape, batch, in_y, in_x, in_channel + group filter_input_depth)]; int32_t filter_val = filter_data[Offset(filter_shape, out_channel, filter_y, filter_x, in_channel)]; / ... / acc += filter_val (input_val + input_offset); }

perf_disable_counter(0);

They are also the same as [tutorial](https://cfu-playground.readthedocs.io/en/latest/step-by-step.html).

- Results I get
Run `make clean renode` and then run the golden test. The outputs are as follows:

Counter | Total | Starts | Average | Raw ---------+--------+--------+---------+-------------- 0 | 4295M | 124418 | 35k | 4294594042 1 | 0 | 0 | n/a | 0 2 | 0 | 0 | n/a | 0 3 | 0 | 0 | n/a | 0 4 | 0 | 0 | n/a | 0 5 | 0 | 0 | n/a | 0 6 | 0 | 0 | n/a | 0 7 | 0 | 0 | n/a | 0 106M ( 105724225 ) cycles total


The perf CSR's output is so large! I cannot figure out what's going wrong here. 
(;′⌒`)

Thanks in advance for your patience with my childish questions. 
alanvgreen commented 2 years ago

Hi @limingxuan-pku,

That is large. It looks as if perhaps per counter zero had been running for a while before starting the test.

You could try printing the value of perf_get_counter(0) just before the perf_enable_counter(0). That would let you know whether the counts are zero before the first one, and how the counts are incrementing over time.

Siris-Li commented 2 years ago

You could try printing the value of perf_get_counter(0) just before the perf_enable_counter(0). That would let you know >whether the counts are zero before the first one, and how the counts are incrementing over time.

@alanvgreen Hello! I have tried your method. The output is strange.

perf_get_counter0 = 0
perf_get_counter0 = 0
perf_get_counter0 = 0
perf_get_counter0 = 0
perf_get_counter0 = 0
perf_get_counter0 = 0
perf_get_counter0 = 0
perf_get_counter0 = 0
..perf_get_counter0 = 0
perf_get_counter0 = 0
..
"Event","Tag","Ticks"
0,DEPTHWISE_CONV_2D,5425
...
 Counter |  Total | Starts | Average |     Raw
---------+--------+--------+---------+--------------
    0    |  4295M | 124418  |    35k  |   4294594042
    1    |     0  |     0  |   n/a   |            0
    2    |     0  |     0  |   n/a   |            0
    3    |     0  |     0  |   n/a   |            0
    4    |     0  |     0  |   n/a   |            0
    5    |     0  |     0  |   n/a   |            0
    6    |     0  |     0  |   n/a   |            0
    7    |     0  |     0  |   n/a   |            0
   277M (    277308337 )  cycles total
  result is 226
---

As you can see, before the final table is printed, it seems that perf counter0 is always 0. Is that because I am using Renode or anything else?

tcal-x commented 2 years ago

Hi @limingxuan-pku , I should have said it more clearly. There is no chance of the performance counters working in Renode -- they are not implemented at all. The value that you got for perf counter 0 is probably some uninitialized memory.

You have two choices:

Siris-Li commented 2 years ago

@tcal-x Thanks for your reply! I've realized the fact that Renode could not implement perf counters. Forgive my stubbornness with Renode and now I've turn to an actual board. ;-)

tcal-x commented 2 years ago

@limingxuan-pku well, the perf counters are mostly a convenience. You could maintain as many software cycle counts as you need, using just $mcycle and code like I showed above. Then you can get (approximate) cycle counts using Renode.