Closed Siris-Li closed 2 years ago
Hi @limingxuan-pku !
Thank you for filing the issue! There are a few topics here.
lu
" -- see #715. I have a patch in review right now to fix this until it's fixed upstream.EXTRA_LITEX_ARGS="--cpu-variant=perf+cfu"
. However, this doesn't work in Renode, since Renode simulates the CPU with its own instruction-level simulator (not Verilog level), and its simulation doesn't have the performance CSRs.More generally, Renode is great for debugging, but not the best for performance analysis, mainly because of the above reason -- it does not use a cycle-accurate simulation of the CPU that you're actually using.
To run in Verilator simulation with perf counters, here is a recipe:
make EXTRA_LITEX_ARGS="--cpu-variant=perf+cfu" PLATFORM=sim load
Yes, it's pretty slow!
NOTE: I find that my terminal is messed up after exiting the Verilator simulation. I need to type reset<enter>
to restore proper behavior.
FYI - @PiotrZierhoffer
More generally, Renode is great for debugging, but not the best for performance analysis, mainly because of the above reason -- it does not use a cycle-accurate simulation of the CPU that you're actually using.
In fairness to Renode, I should say that this is specific to our current use in CFU Playground, not to Renode in general. I believe that recently Renode added funcionality allowing you to swap in a Verilog cycle-accurate model for the CPU as well, in which case you'd get full accuracy and also be able to use the performance counter CSRs. However, this is not implemented in CFU Playground.
More generally, Renode is great for debugging, but not the best for performance analysis, mainly because of the above reason -- it does not use a cycle-accurate simulation of the CPU that you're actually using.
In fairness to Renode, I should say that this is specific to our current use in CFU Playground, not to Renode in general. I believe that recently Renode added funcionality allowing you to swap in a Verilog cycle-accurate model for the CPU as well, in which case you'd get full accuracy and also be able to use the performance counter CSRs. However, this is not implemented in CFU Playground.
Thanks for your patient and detailed interpretations!
I have suceessfully solved the problem about lu
.
When it comes to perf CSRs, I made some changes and got the following result.
export EXTRA_LITEX_ARGS="--cpu-variant=perf+cfu"
to CFU-Playground/proj/example_cfu/Makefile
.CFU-Playground/proj/example_cfu
, run make clean
, then run make renode
.
(If I just make renode
without make clean
, it won't rebuild the whole project, leading to the same result Perf counters not enabled
as before.)
After those steps, I try to run the golden test and got the result as follows.
...............................
"Event","Tag","Ticks"
0,DEPTHWISE_CONV_2D,5424
1,DEPTHWISE_CONV_2D,5533
2,CONV_2D,8207
.......
26,CONV_2D,7213
27,AVERAGE_POOL_2D,28
28,CONV_2D,7
29,RESHAPE,0
30,SOFTMAX,3
Counter | Total | Starts | Average | Raw
---------+--------+--------+---------+--------------
0 | 0 | 0 | n/a | 0
1 | 0 | 0 | n/a | 0
2 | 0 | 0 | n/a | 0
3 | 0 | 0 | n/a | 0
4 | 0 | 0 | n/a | 0
5 | 0 | 0 | n/a | 0
6 | 0 | 0 | n/a | 0
7 | 0 | 0 | n/a | 0
123M ( 122687797 ) cycles total
OK Golden tests passed
It seems that perf CSRs also work well in Renode. After all, I will go on my journey in CFU-Playground using Renode. If there exists some other questions, I will come back and raise questions. Thanks again! (●'◡'●)
@limingxuan-pku if you're interested in more detailed analysis of the execution in Renode, please take a look at the Guest application profiling
chapter in one of our latest note: https://antmicro.com/blog/2022/09/execution-tracing-in-renode/
Long story short, you can create execution traces similar to what you can see here: https://zephyr-dashboard.renode.io/renodepedia/boards/hifive1/?view=software&demo=Hello_World#trace
But as @tcal-x said, Renode is not a cycle-accurate simulator. You will find we're accurate on an instruction level, unless you're using some kind of RTL co-simulation,
@limingxuan-pku if you're interested in more detailed analysis of the execution in Renode, please take a look at the
Guest application profiling
chapter in one of our latest note: https://antmicro.com/blog/2022/09/execution-tracing-in-renode/Long story short, you can create execution traces similar to what you can see here: https://zephyr-dashboard.renode.io/renodepedia/boards/hifive1/?view=software&demo=Hello_World#trace
But as @tcal-x said, Renode is not a cycle-accurate simulator. You will find we're accurate on an instruction level, unless you're using some kind of RTL co-simulation,
Thanks for your advice. I will read these docs carefully to learn more about Renode. ;-)
@tcal-x Well, the next difficulty comes so soon. ╯︿╰ I try to follow The Step-by-Step Guide to Building an ML Accelerator. When I come to utilize the performance counters built into the soft-CPU to count how many cycles one loop takes, the result makes me confused.
CONV_2D
operation, thus run the following in my project folder CFU-Playground/proj/my_first_cfu
:
$ mkdir -p src/tensorflow/lite/kernels/internal/reference/integer_ops/
$ cp \ ../../third_party/tflite-micro/tensorflow/lite/kernels/internal/reference/integer_ops/conv.h \ src/tensorflow/lite/kernels/internal/reference/integer_ops/conv.h
They are the same as the tutorial says.#include "perf.h"
and then surround the inner loop with perf functions like below:
perf_enable_counter(0);
for (int in_channel = 0; in_channel < filter_input_depth;++in_channel) { int32_t input_val = input_data[Offset(input_shape, batch, in_y, in_x, in_channel + group filter_input_depth)]; int32_t filter_val = filter_data[Offset(filter_shape, out_channel, filter_y, filter_x, in_channel)]; / ... / acc += filter_val (input_val + input_offset); }
perf_disable_counter(0);
They are also the same as [tutorial](https://cfu-playground.readthedocs.io/en/latest/step-by-step.html).
- Results I get
Run `make clean renode` and then run the golden test. The outputs are as follows:
Counter | Total | Starts | Average | Raw ---------+--------+--------+---------+-------------- 0 | 4295M | 124418 | 35k | 4294594042 1 | 0 | 0 | n/a | 0 2 | 0 | 0 | n/a | 0 3 | 0 | 0 | n/a | 0 4 | 0 | 0 | n/a | 0 5 | 0 | 0 | n/a | 0 6 | 0 | 0 | n/a | 0 7 | 0 | 0 | n/a | 0 106M ( 105724225 ) cycles total
The perf CSR's output is so large! I cannot figure out what's going wrong here.
(;′⌒`)
Thanks in advance for your patience with my childish questions.
Hi @limingxuan-pku,
That is large. It looks as if perhaps per counter zero had been running for a while before starting the test.
You could try printing the value of perf_get_counter(0) just before the perf_enable_counter(0). That would let you know whether the counts are zero before the first one, and how the counts are incrementing over time.
You could try printing the value of perf_get_counter(0) just before the perf_enable_counter(0). That would let you know >whether the counts are zero before the first one, and how the counts are incrementing over time.
@alanvgreen Hello! I have tried your method. The output is strange.
perf_get_counter0 = 0
perf_get_counter0 = 0
perf_get_counter0 = 0
perf_get_counter0 = 0
perf_get_counter0 = 0
perf_get_counter0 = 0
perf_get_counter0 = 0
perf_get_counter0 = 0
..perf_get_counter0 = 0
perf_get_counter0 = 0
..
"Event","Tag","Ticks"
0,DEPTHWISE_CONV_2D,5425
...
Counter | Total | Starts | Average | Raw
---------+--------+--------+---------+--------------
0 | 4295M | 124418 | 35k | 4294594042
1 | 0 | 0 | n/a | 0
2 | 0 | 0 | n/a | 0
3 | 0 | 0 | n/a | 0
4 | 0 | 0 | n/a | 0
5 | 0 | 0 | n/a | 0
6 | 0 | 0 | n/a | 0
7 | 0 | 0 | n/a | 0
277M ( 277308337 ) cycles total
result is 226
---
As you can see, before the final table is printed, it seems that perf counter0 is always 0. Is that because I am using Renode or anything else?
Hi @limingxuan-pku , I should have said it more clearly. There is no chance of the performance counters working in Renode -- they are not implemented at all. The value that you got for perf counter 0 is probably some uninitialized memory.
You have two choices:
In renode, just use the mcycle
register, and maintain your own global variable for accumulating the cycles:
extern long long unsigned my_cycles; // initialize to 0 somewhere
unsigned my_start = perf_get_mcycle();
...
...code to time...
...
unsigned my_finish = perf_get_mcycle();
my_cycles += (my_finish - my_start);
OR, use the perf counters either with Verilator simulation, or on an actual board.
@tcal-x Thanks for your reply! I've realized the fact that Renode could not implement perf counters. Forgive my stubbornness with Renode and now I've turn to an actual board. ;-)
@limingxuan-pku well, the perf counters are mostly a convenience. You could maintain as many software cycle counts as you need, using just $mcycle
and code like I showed above. Then you can get (approximate) cycle counts using Renode.
I follow the docs in my VM ubuntu 20.04. It seems that everything works well. After I type
make renode
in CFU-Playground/proj/example_cfu, I can get this:However, when I try to run the golden test, I will get the following result.
Though it says
Golden tests passed
, I cannot get the number of “ticks” each operation took to complete. For example,0,DEPTHWISE_CONV_2D,lu
. It showslu
here instead of a number like0,DEPTHWISE_CONV_2D,7892
. I wonder whether it is beacause the renode itself, or due to my neglection of anything? By the way, asPerf counters not enabled.
says, I'd like to know how to enable perf counters?