tum-ei-eda / etiss

Extendable Translating Instruction Set Simulator
https://tum-ei-eda.github.io/etiss/
Other
29 stars 36 forks source link

check if forced misalignment is performance issue #82

Closed rafzi closed 2 years ago

rafzi commented 3 years ago

since we use struct packing for the ETISS_CPU struct, its members should be carefully placed to avoid misaligned memory accesses.

for example:

https://github.com/tum-ei-eda/etiss/blob/master/include_c/etiss/jit/CPU.h#L101

-> consider making ETISS_MAX_RESOURCES a multiple of 8.

https://github.com/tum-ei-eda/etiss/blob/master/ArchImpl/RISCV/RISCV.h#L77

-> should consider this in m2isar code generator. simple solution: sort member definitions by size descending

fpedd commented 3 years ago

Just out of curiosity: What member inside this struct could cause misaligned memory accesses? With the exception of the last member (which is 32bit) the struct only contains 64bit datatypes and pointers (which should be native on both 32bit and 64bit machines). So there should be no issue with misaligned memory accesses, even when packing with #pragma pack(1)? What am I missing? To what extend would making ETISS_MAX_RESOURCES a multiple of 8 help?

One thing I would like to point out when reordering struct members is the temporal and spatial locality of the memory accesses to those members. Sorting the struct members by size could lead to increased cache conflicts/misses and potentially diminish gains from aligned accesses.

rafzi commented 3 years ago

My example was not quite correct. I thought the member was just a string and didn't see the pointer.

Still, the last 32 bit member causes misalignment for any 64 bit members of the architecture specific CPU definitions. An explicit padding could be added.

Yes, there is lots of complexity to consider on x86, so it is often easier to just measure than theorize. The benefit of being in the first cache line should be there for the instructionPointer, but the architecture specific registers are far away from the base pointer.

Samanti-Das commented 2 years ago

I did some tests using Dhrystone benchmark with 1000000 runs and the MIPS value compared across several permutations and combinations of variables within the struct as follows:

Original MIPS (without any change): GCC: 6.72723; TCC: 6.25384;

Permutation 1: (32 bit first, rest all 64 bit for struct in CPU.h : https://github.com/tum-ei-eda/etiss/blob/master/include_c/etiss/jit/CPU.h#L101 ): MIPS: GCC: 6.8502; TCC: 6.38286;

Permutation 2: (32 in between 64 bits in CPU.h): MIPS: GCC: 6.63403; TCC: 6.21875;

Permutation 3: (32 in between 64 bits in CPU.h): MIPS: GCC: 6.71754; TCC: 6.26565;

Permutation 4: (64 bit first and then 32 bit in RISCV.h : https://github.com/tum-ei-eda/etiss/blob/master/ArchImpl/RISCV/RISCV.h#L77 ): MIPS: GCC: 6.63754; TCC: 6.22565;

Permutation 5: (64 bit in between 32 bits in RISCV.h): MIPS: GCC: 6.7102; TCC: 6.234;

The observation shows that at least in the above tests, it does not make any significant difference regarding performance.

rafzi commented 2 years ago

Thanks for testing! This does indeed seem insignificant currently.