Closed PhilippvK closed 1 year ago
To have comparable targets, the measurement would ideally only measure the time spent executing the model and the memory usage caused by the model. However this is probably quite challenging. We could use a "baseline" application to compare to, but it should not include any drivers or print code that would not be present in an realistic deployment scenario of the application. For example including printf in both, could pull in other functions that would also have been pulled in by the deployed model inference, therefore under-estimating the model size.
The total deployed sizes are of course still a useful metric, because they determine what is actually deployable on a real target.
I would say main()
or the time spent in the bootloader/startup code should not count towards "application" (aka. model) execution time. For most application scenarios they are only encountered once when the system boots and are thus negligible. We should mention that somewhere and then stay consistent when measuring execution times in that we start and stop the measurement right before and right after the end of the execution of the application.
Another thing that we might want to consider are speedups resulting from repeated execution of the same application multiple times. Keras and TF, I believe, do certain runtime optimizations that result in significant speedup for the 2nd and following runs. But I believe no such optimizations are happening at runtime on microcontrollers. @rafzi, you probably know a lot more about that.
Compared to measuring application size, the actual process of measuring application execution time is relatively straightforward. We have ways for all the above-mentioned platforms to measure the number of elapsed clock cycles (finally also on esp32/esp32-c3 @PhilippvK ;) ), which in turn can then be converted to time. The hard part here is accessing these numbers. I personally see no other consistent and cross-platform way of accessing these numbers other than printf
. And since the actual call to printf
is happening outside the "measurement window" I don't see any issues concerning timing. Of course, the impact of printf
on application size is another story.
As you pointed out @rafzi only measuring the size of the application is hard. Do you have any idea how one could do that? Is there maybe a way to extract that information at link time somehow? Again, that would probably be extremely complicated and tedious. So getting the size of the application directly is not feasible, at least I would have no idea how to do that.
Thus, the only reasonable solution to that problem is the approach suggested by @PhilippvK: Introducing a consistent way of creating executables (consistent usage of printf
, etc.) across applications, executables, and platforms. This would ideally allow us to 1) measure/compare sizes between applications on the same platform, and 2) compare sizes of applications between platforms. This could be achieved by, as you suggested @rafzi, using a baseline application (image) and subtracting its size from the "target" application (image). I see your point with respect to other functions being pulled in by printf
s (and other driver functions) in the baseline and thus reducing the actual size of the application. Simply removing the printf
s from the baseline would mean, however, that the printf
function size is counted towards the application size, which is generally undesirable (correct me if that's not the case). One solution could be to create two versions of the application: One with all necessary printf
s and measuring harnesses. It would be used to measure execution time. And then a second "bare minimum" version which could be used to measure application size by substracting its size from the baseline size (baseline would, of course, also have to be "bare minimum").
I agree with you @rafzi that total image size is also very relevant. I would suggest that we should, wherever possible, generally consider both "application" and total "image" size. Maybe we can introduce a common term for both. I personally like the German terms "netto" and "brutto" size. But I am not that fond of their English counterpart's "Gross" and "Net" size. I am sure there are better names for those.
yes, great point @fabianpedd . i agree that such a solution with three builds seems like the best way for our measurements.
i don't think there is anything special to consider for multiple runs if we make sure to exclude "init/prepare" for tflm. while it would be possible to do some tricks in "invoke", it is supposed to be a pure function, also for compatibility with real-time systems.
The idea of building several binaries makes a lot of sense to me. Once we tackle this, we just need to find a good way to implement it in MLonMCU in a consistent way.
@rafzi Do you think we would need to execute both binaries in the end, to incorporate stack/heap usage?
while it would be possible to do some tricks in "invoke", it is supposed to be a pure function, also for compatibility with real-time systems.
I am not sure if we could assume that, i.e. in a BOYC generated TVM kernel someone (me ;-) ) might have done the following as there is (or was?) not other way to achieve this:
static int initialized = 0;
if (!initialized) {
// ...
initialized = 1;
}
We currently have a way to get the cycles for the actual execution of a model without having to measure init
and invoke
seperately. It can be used with --num 1 --num 2 --postprocess detailed_cycles
.
We only have a way to measure time/cycles for specific sections of code if there is a way to access this information at runtime. (e.g. using RISCV CSRs or a Timer) As long as we do not have such way for ETISS, we do not have another choice to achieve this.
yes, for a meaningful stack usage report we'd need to execute the "realistic app without profiling code", because especially printf has a huge stack footprint.
that could be prevented by having a feature to turn etiss tracing on/off dynamically during execution. that would probably be blocked until there is a semi-hosting interface.
Some new: ETISS now finally supports CSR performance counters. I am goint to implement them soon to have comparable measurements between all the RISC-V targets.
However if will also add a config which can still enable the usual (end-to-end) total cycles measurement.
The discusses feature (compile model several times for different measurements) is still a TODO.
Might reopen at some point in time
This issue should document the following issue:
Currently we have use several different ways to access the number of cycles/instructions for executing a model:
spike
: Use RISCV performance counters at runtime to measure elapsed cycles duringmain
- Introducing a rather large ROM overhead as we have to linkprintf
etc. even in Release mode. (RAM/Cycles overhead should be negligible)ovpsim
: Parse stdout for metrics printed AFTER simulation. However as target_sw is the same as used byspike
the same overheads are expected. (This could be changed easily)etiss_pulpino
: Parse stdout for metrics printed AFTER simulation. (Alternative: Use json file which can be generated by VP) - This approach does not rely in printf and thus leads to much smaller program sizes. Once performance counters are implemented here as well, we could use them instead to be consistent.corstone300
: Similar tospike
/ovpsim
esp32
/esp32c3
: Using ESP Timer and printing elapsed cycles via UART - Similar overheads forprintf
/string
handling + additional drivers for UART,...There is probably no good solution to this, even if every used Simulator would provide a way to access those metrics without using
printf
that would still not be applicable to real hardware targets.Maybe we should agree on a consistent way to get the Cycles/Instructions e.g. using
printf
etc. in every program to make the targets more comparable. Who du you think @rafzi @fabianpedd?Another discussion:
main()
or also the time spent in the bootloader/startup code? Currently the approach is different from target to target.