openpower-cores / a2i

Other
243 stars 40 forks source link

Questions regarding testing A2I with Coremark on FPGA without OS #31

Closed xinyu8888 closed 3 years ago

xinyu8888 commented 3 years ago

Hi @openpowerwtf , we are currently testing A2I with Coremark (https://github.com/eembc/coremark) on FPGA without OS and L2 cache (in our case, the core jumps to the Coremark program address space and starts running it immediately when the core finishes running boot.s). The highest score we achieved is 1.33 CoreMark/MHz with the compilation optimization option set as –Ofast. We have not been able to make our hardware platform work with OS, so I’m wondering whether running Coremark with OS would help with the final score or not? Besides, how bad the impact on the Coremark test would be if not having a L2 cache? We couldn’t find a proper linux kernel that is compatible with a2i core. It would be very time-consuming if we try to do the complete kernel porting and clipping from scratch. Is there any good reference kernel that you can recommend to us? Thanks!

xinyu8888 commented 3 years ago

@openpowerwtf the reason why we care about the Coremark test result so much is because we have to get some good experimental results to show to the upper management guys that this A2 core project is worth investigating and puting in a lot of manpower. That's why we are trying so hard to improve the Coremark score to prove that this core is a good one, so that we can get the upper management guys approval. Otherwise, we might have to drop this project and start seeking for other directions. We really hope we can get some support from your team in this case.

openpowerwtf commented 3 years ago

Linux To experiment with the current logic, you have to roll your own (possibly ~v2.6 or so). Recent Power kernels use vector ops, radix, etc., which obviously aren't implemented. It would be more productive to start working on the updates required for compliancy (submit pull requests 👍 ), which would also benefit A2O.

Coremark I don't think there would be significant change with L2; I doubt much is happening on the bus after the first iteration except the stores (write-through).

What number are you trying to reach? Your results are ~5300/s at 4GHz. I think you might get about 2.25x for SMT4, which would be ~12K. A2I was designed for overall throughput, not single-thread performance. It doesn't issue 2/cycle unless there are multiple threads and one is FP. It's also deeply pipelined for 27 FO4, so it's expected to run at a fairly high frequency. You could experiment with the branch-prediction bits in IUCR0 but probably wouldn't change results much.

The customizable AXU interface promotes the creation of tightly-coupled application-specific hardware to boost per-core performance (see BlueGene Q). It also makes it easy to have heterogeneous cores to balance system power/perf. Plus you can mix-and-match A2I and A2O since they use the same A2L2 interface.

zhaoxiahust commented 3 years ago

For the in-order processors, the CoreMark score is usually 2 CoreMark/MHz. As the A2I is the in-order core, I did not expect its core will go beyond the 2 CoreMark/MHz. To identify the reasons causing the current CoreMark score, i.e., 1.33 CoreMark/MHz, I think it is necessary to report some performance metrics such as the TLB miss rate, memory accesses per kilo instructions, L1 cache miss rate, and so on. In this way, it is possible to find the performance bottleneck.

openpowerwtf commented 3 years ago

@zhaoxiahust Keep digging!

There would be few if any TLB misses since the memory footprint is tiny - and no translation misses if you are running ERAT-only mode. You can count the reads and ifetches with the ILA on the AXI bus; there will probably be some minor stack activity after the caches warm up but not much else. Otherwise, everything is happening within the core.

See chapter 11 of UM - there are lots of inner core debug/performance events to capture. You can set that up and trace with ILA. An NIA trace would be the best - you could then calculate a histogram of latencies, plus find the NIAs where the most total cycles are being used.

Are you comparing results to similar-length pipes, or short pipes that can't reach high-frequency? What about multithreaded capabilities? Implementation matters. High-speed processors pay area/clock/branch mispredict penalties to get greater overall performance. Multithreaded processors pay area penalty but use hardware that would otherwise be idle.

Just a guess, but two likely suspects - load-use penalty, and branch mispredicts. Both would be partially covered by running multithreaded.

zhaoxiahust commented 3 years ago

Yes, a cow or tens of chickens. A2I seems to be a chicken so multithreading matters a lot.