tum-ei-eda / mlonmcu

Tool for the deployment and analysis of TinyML applications on TFLM and MicroTVM backends
Apache License 2.0
29 stars 12 forks source link

Printing the number of Instructions/Cycles to increases the program size #34

Closed PhilippvK closed 1 year ago

PhilippvK commented 2 years ago

This issue should document the following issue:

Currently we have use several different ways to access the number of cycles/instructions for executing a model:

There is probably no good solution to this, even if every used Simulator would provide a way to access those metrics without using printf that would still not be applicable to real hardware targets.

Maybe we should agree on a consistent way to get the Cycles/Instructions e.g. using printf etc. in every program to make the targets more comparable. Who du you think @rafzi @fabianpedd?

Another discussion:

rafzi commented 2 years ago

To have comparable targets, the measurement would ideally only measure the time spent executing the model and the memory usage caused by the model. However this is probably quite challenging. We could use a "baseline" application to compare to, but it should not include any drivers or print code that would not be present in an realistic deployment scenario of the application. For example including printf in both, could pull in other functions that would also have been pulled in by the deployed model inference, therefore under-estimating the model size.

The total deployed sizes are of course still a useful metric, because they determine what is actually deployable on a real target.

fpedd commented 2 years ago

Concerning execution time:

I would say main() or the time spent in the bootloader/startup code should not count towards "application" (aka. model) execution time. For most application scenarios they are only encountered once when the system boots and are thus negligible. We should mention that somewhere and then stay consistent when measuring execution times in that we start and stop the measurement right before and right after the end of the execution of the application.

Another thing that we might want to consider are speedups resulting from repeated execution of the same application multiple times. Keras and TF, I believe, do certain runtime optimizations that result in significant speedup for the 2nd and following runs. But I believe no such optimizations are happening at runtime on microcontrollers. @rafzi, you probably know a lot more about that.

Compared to measuring application size, the actual process of measuring application execution time is relatively straightforward. We have ways for all the above-mentioned platforms to measure the number of elapsed clock cycles (finally also on esp32/esp32-c3 @PhilippvK ;) ), which in turn can then be converted to time. The hard part here is accessing these numbers. I personally see no other consistent and cross-platform way of accessing these numbers other than printf. And since the actual call to printf is happening outside the "measurement window" I don't see any issues concerning timing. Of course, the impact of printf on application size is another story.

Concerning size:

As you pointed out @rafzi only measuring the size of the application is hard. Do you have any idea how one could do that? Is there maybe a way to extract that information at link time somehow? Again, that would probably be extremely complicated and tedious. So getting the size of the application directly is not feasible, at least I would have no idea how to do that.

Thus, the only reasonable solution to that problem is the approach suggested by @PhilippvK: Introducing a consistent way of creating executables (consistent usage of printf, etc.) across applications, executables, and platforms. This would ideally allow us to 1) measure/compare sizes between applications on the same platform, and 2) compare sizes of applications between platforms. This could be achieved by, as you suggested @rafzi, using a baseline application (image) and subtracting its size from the "target" application (image). I see your point with respect to other functions being pulled in by printfs (and other driver functions) in the baseline and thus reducing the actual size of the application. Simply removing the printfs from the baseline would mean, however, that the printf function size is counted towards the application size, which is generally undesirable (correct me if that's not the case). One solution could be to create two versions of the application: One with all necessary printfs and measuring harnesses. It would be used to measure execution time. And then a second "bare minimum" version which could be used to measure application size by substracting its size from the baseline size (baseline would, of course, also have to be "bare minimum").

I agree with you @rafzi that total image size is also very relevant. I would suggest that we should, wherever possible, generally consider both "application" and total "image" size. Maybe we can introduce a common term for both. I personally like the German terms "netto" and "brutto" size. But I am not that fond of their English counterpart's "Gross" and "Net" size. I am sure there are better names for those.

rafzi commented 2 years ago

yes, great point @fabianpedd . i agree that such a solution with three builds seems like the best way for our measurements.

i don't think there is anything special to consider for multiple runs if we make sure to exclude "init/prepare" for tflm. while it would be possible to do some tricks in "invoke", it is supposed to be a pure function, also for compatibility with real-time systems.

PhilippvK commented 2 years ago

The idea of building several binaries makes a lot of sense to me. Once we tackle this, we just need to find a good way to implement it in MLonMCU in a consistent way.

@rafzi Do you think we would need to execute both binaries in the end, to incorporate stack/heap usage?

while it would be possible to do some tricks in "invoke", it is supposed to be a pure function, also for compatibility with real-time systems.

I am not sure if we could assume that, i.e. in a BOYC generated TVM kernel someone (me ;-) ) might have done the following as there is (or was?) not other way to achieve this:

static int initialized = 0;
if (!initialized) {
    // ...
    initialized = 1;
}

We currently have a way to get the cycles for the actual execution of a model without having to measure init and invoke seperately. It can be used with --num 1 --num 2 --postprocess detailed_cycles.

We only have a way to measure time/cycles for specific sections of code if there is a way to access this information at runtime. (e.g. using RISCV CSRs or a Timer) As long as we do not have such way for ETISS, we do not have another choice to achieve this.

rafzi commented 2 years ago

yes, for a meaningful stack usage report we'd need to execute the "realistic app without profiling code", because especially printf has a huge stack footprint.

that could be prevented by having a feature to turn etiss tracing on/off dynamically during execution. that would probably be blocked until there is a semi-hosting interface.

PhilippvK commented 2 years ago

Some new: ETISS now finally supports CSR performance counters. I am goint to implement them soon to have comparable measurements between all the RISC-V targets.

However if will also add a config which can still enable the usual (end-to-end) total cycles measurement.

The discusses feature (compile model several times for different measurements) is still a TODO.

PhilippvK commented 1 year ago

Might reopen at some point in time