[CR] Add VSX Hadamard - Githubissues

lu-zero / libvpx

Local libvpx changes (POWER8 Altivec/VSX support)

BSD 3-Clause "New" or "Revised" License

5 stars 3 forks source link

[CR] Add VSX Hadamard #16

Closed rafaeldelucena closed 7 years ago

rafaeldelucena commented 7 years ago

Description

This PR add Hadamard Transform using the VSX instructions set, aiming to optimize this operation for Power8 in order to resolve the issue #6

Checklist:

[x] Add code to Build System
[x] Implement the Transpose 8x8 for signed shorts
[x] Add Unit tests
[x] Implement Hadamard 8x8
[x] Implement Hadamard 16x16
[x] Test for endianness and unaligned addresses issues
[x] Make all tests pass

rafaeldelucena commented 7 years ago

@lu-zero I've successful test the implementation on both little and big endian architectures, also I'm using the vec_vsx_ld and vec_vsx_st to load and store data, so I'll mark the Test for endianness and unaligned addresses issues as checked for now, I think it's ready for review. :)

rafaeldelucena commented 7 years ago

This is a first version, I don't apply any further optimizations, but if you find it necessary I can make the vpx_transpose_8x8_s16 and vpx_hadamard_8x8_s16_one_pass inline functions also I can use an aligned address buffer and replace to simple load and store functions for char arrays, aiming to reduce the permutation overhead of the used instructions.

lu-zero commented 7 years ago

Using the test harness I'm not seeing any speedup, I guess some more work might be needed.

lu-zero commented 7 years ago

Seems that the test harness does not double as benchmark harness easily, do you have the alternative benchmark code ready to be in the tree?

rafaeldelucena commented 7 years ago

I've some test code here, I'll push until the end of the day.

rafaeldelucena commented 7 years ago

The benchmark code is very simple, runs the hadamard implementation with multiple strides from 10^1 to 10^6 times with the same input matrix.

rafaeldelucena commented 7 years ago

Already merged to upstream