Closed rafaeldelucena closed 7 years ago
@lu-zero
I've successful test the implementation on both little and big endian architectures, also I'm using the vec_vsx_ld
and vec_vsx_st
to load and store data, so I'll mark the Test for endianness and unaligned addresses issues
as checked for now, I think it's ready for review. :)
This is a first version, I don't apply any further optimizations, but if you find it necessary I can make the vpx_transpose_8x8_s16
and vpx_hadamard_8x8_s16_one_pass
inline functions also I can use an aligned address buffer and replace to simple load and store functions for char arrays, aiming to reduce the permutation overhead of the used instructions.
Using the test harness I'm not seeing any speedup, I guess some more work might be needed.
Seems that the test harness does not double as benchmark harness easily, do you have the alternative benchmark code ready to be in the tree?
I've some test code here, I'll push until the end of the day.
The benchmark code is very simple, runs the hadamard implementation with multiple strides from 10^1 to 10^6 times with the same input matrix.
Already merged to upstream
Description
This PR add Hadamard Transform using the VSX instructions set, aiming to optimize this operation for Power8 in order to resolve the issue #6
Checklist: