I used a ARM machine to test the end-to-end output, but the performance does not match the results mentioned in the paper. The tested data of llama.cpp and T-MAC is nearly same. I've posted the measured data below.
And the frequency of this machine is 2.5 GHz, the bandwidth of this machine 2.6 G/s per core.
I used a ARM machine to test the end-to-end output, but the performance does not match the results mentioned in the paper. The tested data of llama.cpp and T-MAC is nearly same. I've posted the measured data below. And the frequency of this machine is 2.5 GHz, the bandwidth of this machine 2.6 G/s per core.