nicocvn / cppreg

A C++11 header-only library for MMIO registers
https://nicocvn.github.io/cppreg/
Other
60 stars 6 forks source link

Document cppreg (zero) impact on performance and code size. #1

Closed sendyne-nicocvn closed 6 years ago

sendyne-nicocvn commented 6 years ago

The main goal is to provide a document (e.g., Performance.md) where we show assembly outputs for various level of optimizations (and possibly compilers) to illustrate that cppreg does not affect runtime performance or code size.

This requires:

We could put the various materials in a benchmark directory to not pollute the main one.

hak8or commented 6 years ago

As I was cleaning up an old project of mine, I started thinking it might be a decent example showing a full use case comparing CMSIS and cppreg. On an unrelated note, the minimal example to do a clock tree init and get a LED toggling at ~1hz is only ~280 bytes!

For the performance.md, my idea is to include a godbolt based example for ::write, ::chained_write.with.with, and ::is_set (so four operations total) in terms of the C++ code and assembly output using -Os. This would be used to show that the resulting instructions cannot be optimized further.

Then, a full example, possibly implementing clock tree initialization and then some UART transmissions and LED blinking, with various optimization flags (-O0,-Os,-O1,-O2, -O3, -Og) for cppreg vs CMSIS. The data would be presented in the form of two tables, each with two columns (cppreg vs CMSIS) and 6 rows (optimization levels) table. The first table would be total size of the resulting binary in bytes, and the second table somehow showing how long it takes to do the clock init and a few million loops of the LED toggling and UART sending, in milliseconds or cycles.

sendyne-nicocvn commented 6 years ago

Sounds good to me.

The clock tree + UART + LED would be a great example. I assume 280 bytes is without startup code (i.e., no interrupts table and SystemInit) but that would actually be clearer that way. For the data this seems like a good start. Once we collect them we can decide on how to organize them.

I created the performance branch so that we can start putting code in the repository.

hak8or commented 6 years ago

That is actually with the startup code and everything. Check main.cpp, I wrote my own attempt at startup code. It doesn't work with constructors or destructor yet (only meant for a minimal c example), but it does get a led blinking.

On Feb 23, 2018 5:56 AM, "Sendyne Principal Scientist" < notifications@github.com> wrote:

Sounds good to me.

The clock tree + UART + LED would be a great example. I assume 280 bytes is without startup code (i.e., no interrupts table and SystemInit) but that would actually be clearer that way. For the data this seems like a good start. Once we collect them we can decide on how to organize them.

I created the performance http:///sendyne/cppreg/tree/performance branch so that we can start putting code in the repository.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/sendyne/cppreg/issues/1#issuecomment-367977537, or mute the thread https://github.com/notifications/unsubscribe-auth/AA5gCm9pWRk1c2Mx3ZGCafu-1frc_LRWks5tXplRgaJpZM4SPiMk .

hak8or commented 6 years ago

Scenario

I decided to heavily leverage godbolt in the comparison because I feel it will be much more accessible to anyone who will want to go through the edit-compile-lookatassembly loop themselves. They won't have to download a compiler, will be easier to show comparisons, etc.

Anyways, for a decent yet simple comparison it would be good to turn a led on, and then in a loop turn the led off, wait a bit turn the led on, and loop back. This is small enough that in terms of code it fits in a screen in terms of lines of code, is easy to understand by everyone, and doesn't bog the example down in terms of setting up a lot of registers (which would be needed to do a full demo including clocking and power gating).

Progress

Here is what I am working with right now (the URL is so long that even neither google nor bit.ly can shorten it!). It looks very similar but there are two interesting differences.

I want to focus a bit on #5 before continuing on writing up a performance comparison. @sendyne-nclauvelin Do let me know if you think the comparison scenario is appropriate though so the first task can be marked as complete.

hak8or commented 6 years ago

On a somewhat related note, I wonder if it would be worthwhile to include this in a larger (many registers) comparison somehow as another metric when looking at the assembly alone is not feasible.

sendyne-nicocvn commented 6 years ago

Subsequent to #6, #8 and #9 I think we can resume the performance assessment. I forwarded the performance branch to the current state.

hak8or commented 6 years ago

Agreed, will aim to have it in a good state by tonight.

hak8or commented 6 years ago

I was working and spotted two "issues" that cause differences.

  1. For registers which have a read only field and you are writing to another field, the bits relevant to the read only field are being cleared. I do not see why there should be more work done (clearing it) compared to just writing the old relevant bits back. I do not recall ever seeing a register that contains a read only field which required writing back 0's to the associated field (which would make it not read only), instead just that all writes were ignored. This does introduce overhead I feel because with cppreg as-is you cannot disable it (write back old values instead of clearing). What was your line of reasoning for clearing the bits?

  2. The lack of expressing related registers (CMSIS style structures mapped to memory) sadly does introduce "overhead" in the form of the compiler sticking two close addresses in .text and then overwriting an old CPU register with the new address (introducing more register pressure), instead to just doing memory interfacing using offsets.

This is the example I am planning to use for the comparison.

Ignoring those two issues, the assembly is identical.

sendyne-nicocvn commented 6 years ago

For the second point this relates to #7 and I have first to create an example which isolate the issue to understand a bit better what is exactly happening. For the first point can you provide a minimal example because I am not clear on what you are describing?

sendyne-nicocvn commented 6 years ago

Also in your example you have:

UART::STATUS::merge_write<UART::STATUS::Enable>(1).with<UART::STATUS::Sending>(1).done();

Using:

UART::STATUS::merge_write<UART::STATUS::Enable,1>().with<UART::STATUS::Sending,1>().done();

simplifies the generated assembly (look at the L3 branch). This an important detail ... cppreg uses a faster implementation if the data are known at compile time and to indicate that you need to use the template version. This is also true for regular write calls.

At this point the only difference is how offsets between registers is managed in cppreg. Will work on that for now.

hak8or commented 6 years ago

Ah, I didn't realize that could be done via template arguments for merge writes (is that part of the new API changes you did recently?).

The minimal example is here. Looking at it again, I realized I misread the assembly originally. It looked like the BICS instruction was clearing bits that are in the readonly field, but turns out it was actually clearing the bits that were being actually written to via the merge write. In that case, I agree that your fix (via template arg) does fix the issue.

When/if the register offsets concept gets put in, then the assembly should finally match for pretty much all use cases, hopefully.

sendyne-nicocvn commented 6 years ago

Ah, I didn't realize that could be done via template arguments for merge writes (is that part of the new API changes you did recently?).

No this was already there. As part of #10 I added more details to the API documentation regarding this particular point.

hak8or commented 6 years ago

What do you think as of 08bc98cc2ffd117bd74b960d84cfd68ac2800d9b?

sendyne-nicocvn commented 6 years ago

Looks good. I actually prefer that we only present a small example rather than a lengthy and complex one. I will fix some typos and tie it with the README.

hak8or commented 6 years ago

Awesome!

sendyne-nicocvn commented 6 years ago

The performance comparison is now available in the master branch so I consider this issue closed.