loader.py: region alignment to support -O3

malus-brandywine commented 1 year ago

Compilation of loader.c with optimization flag -ftree-loop-vectorize (included in -O3) generates code which uses 128-bit q registers while copying regions from image to lower addresses.

To make source address "base + offset" multiple of 16, function 'write_image' now rounds up the offset and properly fills in gaps between regions if needed during image building.

malus-brandywine commented 1 year ago

@Ivan-Velickovic , you might want to add it to your 'dev' branch

Ivan-Velickovic commented 1 year ago

@malus-brandywine Did you experience this when trying to run a particular platform or is this something you noticed by looking at the code? To me it seems odd that the compiler would make an optimisation based on assumptions about the alignment of the addresses. However, I do not know much about C compiler optimisations.

malus-brandywine commented 1 year ago

@Ivan-Velickovic I experienced it when I run an example application on Ultra96V2 (I added the board locally). I had exception "data abort" with "alignment" reason and investigated disassembled code. Long story short: function memcpy

https://github.com/BreakawayConsulting/sel4cp/blob/0441237b6f5aea78c739f9850e95d4bd194dde6b/loader/src/loader.c#L106

which copies regions byte by byte is optimized to copy 8 bytes with one operation using q register, the remainder (mod 8) is copied byte by byte. Specifically, -ftree-loop-vectorize option makes this effect.

Ivan-Velickovic commented 1 year ago

It looks like this also affects the QEMU ARM virt platform when I was trying to run a hello world with optimisations turned on. I can confirm that adding -fno-tree-loop-vectorize to the flags passed when compiling the loader, fixes the issue I was having. I will try your patch as well. Thanks for properly looking into the issue and reporting it.

bennoleslie commented 1 year ago

I expect that changing the data layout to be 16-byte aligned is a good change, that we should implement.

However, it should be just a performance, and not a correctness improvement.

I think we need to work out why the compiler thinks it can optimize this memcpy in the way it is. Certainly, in general, memcpy is not specified in a way that should require alignment in this manner, so we have a bug somewhere that makes the compiler think that the points being passed are aligned. Before changing the system generation to force them to be aligned, I'd like to understand where the bug is. It's possible this could bite us somewhere else at some point in time.

malus-brandywine commented 1 year ago

@bennoleslie Thanks for the heads-up, I'll look into it.

malus-brandywine commented 1 year ago

The loop inside memcpy() looks ideal for vectorization.

The compiler fulfills analysis of the loop and the data properties and eventually chooses maximum vectorization factor: 16 units (chars).

Further alignment analysis assumes that the architecture allows unaligned accesses and consequently vectorizes the loop not taking the alignment into consideration. The whole amount of data to be copied is copied starting from chunks of 16 bytes, then a piece of 8 bytes and the remainder - byte after byte.

If we set flag “-mstrict-align”, the compiler peels few loop iterations to have following accesses aligned to 16 bytes and then resumes making the same drill as described above.

So, if we can provide such ideal loops with aligned data (like in this specific case) we will have more efficient vectorization. My suggestion would be to avoid using “-mstrict-align” at the project level, but when we encounter “not alignable” case, then we can use gcc option pragmas:

#pragma GCC push_options
#pragma GCC target (“strict-align")
…
#pragma GCC pop_options

around the function with the loop.

Thoughts? Comments, please?

gernotheiser commented 1 year ago

Will vectorising lead to the use of FPU registers? If so it should be avoided in systems code, to avoid having to context-switch the FPU on server calls

malus-brandywine commented 1 year ago

That's correct, vectorization leads to use of FP/SIMD registers. To avoid that we add compilation flag -fno-tree-vectorize when optimization -O3 is on (for gcc 10.2).

Ivan-Velickovic commented 1 year ago

This optimisation is observed in the loader, which runs before seL4, I don't see any reason to turn it off here. Any user code that runs after seL4 has started would have to decide whether or not they want to use the FPU. Since the libsel4cp library is linked with all PDs and therefore is user code, we need to make sure that it does not use any FPU/SIMD registers. However, I highly doubt that it would given the contents of the library.

Ivan-Velickovic commented 1 year ago

... we need to make sure that it does not use any FPU/SIMD registers. However, I highly doubt that it would given the contents of the library.

We could also give -mgeneral-regs-only to the compiler when compiling the library to make sure it does not use any floating-point or SIMD registers (according to https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html).

malus-brandywine commented 1 year ago

I made the commit which adds -mgeneral-regs-only only as a reminder to keep the feature in a repository which will become "main" eventually - current version of Ivan's sel4cp repository is already far advanced.

I hope this closes the issue. I would be happy to drop my local fork when the changes are merged. Thank you, @Ivan-Velickovic @bennoleslie @gernotheiser

malus-brandywine commented 1 year ago

Closing the PR; the changes will be suggested for TS' (@Ivan-Velickovic) sel4cp repository, dev branch.

seL4 / microkit

loader.py: region alignment to support -O3 #16