Open msc4 opened 5 years ago
I'm sure @sasshka appreciates your passive-aggressive comment.
@msc4 some comments from me:
@malvanos: I was ready to prepare a patch when I saw your rejected pull requests. My patch would be just another marginal improvement, I can't target the POWER9, and I guess I won't be allowed to join the team.
Sorry about the passive-aggressiveness. Thought I had a good solution, what a disappointment.
My quant_2x2 patch on x264 list was not a solution. I also rewrote the quant_2x2 function, but I was not able to get any performance gain using POWER9. So, I didn't bother sending another patch.
If you are talking about the zigzag_scan_4x4 discussion (it was related to the bountysource) you have to understand that other people before me (us) worked on this project and made a lot of contributions. So, they have priority.
Moreover, I have other patches pending, however I need to sign the Contributor License Agreement before they merge my code with the master of x264 (the official repository not this one).
If you have a good solution, you are welcome to open a request. If you don't have access to Power9, I can give you instructions how to get it.
Anyway no need to be aggressive :)
This is what I had in mind. If it's useful, feel free to call it your own.
int x264_quant_2x2_dc_altivec( int16_t dct[4], int mf, int bias ) { vec_s32_t mfv = vec_splats( mf ); vec_s32_t biasv = vec_splats( bias ); vec_s32_t sh = vec_splats( 8 (int32_t)(long) dct ); vec_s32_t zero = vec_splats( 0 ); vec_s16_t dctv = vec_ld ( 0, dct ); vec_s32_t coef = vec_unpackh( vec_slo( dctv, (vec_u8_t) sh ) ); vec_s32_t sign = (vec_s32_t) vec_cmplt( coef, zero ); vec_s32_t abs_coef = vec_max( vec_sub( zero, coef ), coef ); abs_coef = vec_add( abs_coef, biasv ); abs_coef = vec_mulo( (vec_s16_t) abs_coef, (vec_s16_t) mfv ); abs_coef = vec_sr( abs_coef, (vec_u32_t) vec_splats( 16 ) ); coef = vec_sub( vec_xor( abs_coef, sign ), sign ); vec_s16_t outv = vec_packs( coef, coef ); vec_ste( (vec_s32_t) outv, 0, (int32_t ) dct ); vec_ste( (vec_s32_t) outv, 4, (int32_t *) dct ); return vec_any_ne( coef, zero ); }
@msc4
Checkasm fails with gcc-8 on Power9. ( quant : [FAILED]). I will take a close look next week of your code.
Is that a ppc64le? It worked on a 32 bit big endian box.
Yes ppc64le!
Here is my proposed solution:
int x264_quant_2x2_dc_altivec( int16_t dct[4], int mf, int bias )
{
LOAD_ZERO;
vec_s16_t temp1v = vec_vsx_ld(0, dct);
vec_s32_t mf_v = vec_splats( (int32_t)mf );
static vec_s32_t vec16 = vec_splats( (int32_t) 16 );
vec_s32_t bias_v = vec_splats( (int32_t)bias );
vec_s32_t tmp3v = vec_unpackh(temp1v);
vec_s32_t vec_pos = (( bias_v + (tmp3v)) * (mf_v) >> vec16);
vec_s32_t vec_neg = - (( bias_v - (tmp3v)) * (mf_v) >> vec16);
vector bool int isgtzero = vec_cmplt( tmp3v, zero_s32v );
vec_s32_t result_v = vec_sel( vec_pos, vec_neg, isgtzero);
vec_s16_t results_v = (vec_s16_t) vec_u32_to_u16( (vec_u32_t) result_v );
VEC_STORE8( results_v, dct);
return vec_any_ne(result_v, zero_s32v);
}
Can you try it in your machine?
In my machines I have:
After some thinking I believe that this line:
vec_s32_t tmp3v = vec_unpackh(temp1v);
will create issues in big endian machines. @msc4 Can you replace it with vec_unpackl(temp1v) and verify If it works in your machine?
How about this?
int x264_quant_2x2_dc_altivec( int16_t dct[4], int mf, int bias ) { vec_u16_t mfv = vec_splats( (uint16_t) mf ); vec_s16_t biasv = vec_splats( (int16_t) bias ); vec_s16_t zero = vec_splats( (int16_t) 0 );
vec_s16_t dctv = vec_xl( 0, dct );
vec_s16_t db = vec_mergeh( biasv, dctv );
vec_s16_t dctv = vec_ld( 0, dct );
vec_u32_t sh = vec_splats( 8 * (uint32_t)(long) dct );
dctv = vec_slo( dctv, (vec_u8_t) sh );
vec_s16_t db = vec_mergeh( dctv, biasv );
dctv = vec_sro( dctv, (vec_u8_t) sh );
vec_s16_t db = vec_mergeh( biasv, dctv );
vec_u32_t sign = (vec_u32_t) vec_cmplt( db, zero );
vec_u16_t abs_db = (vec_u16_t) vec_abs( db );
vec_u32_t pr = vec_msum( abs_db, mfv, sign );
vec_s32_t coef = (vec_s32_t) vec_xor( pr, sign );
coef = vec_sra( coef, (vec_u32_t) vec_splats( 16 ) );
vec_s16_t outv = vec_packs( coef, coef );
vec_ste( (vec_s32_t) outv, 0, (int32_t *) dct );
vec_ste( (vec_s32_t) outv, 4, (int32_t *) dct );
return vec_any_ne( coef, (vec_s32_t) zero );
}
@malvanos Your solution has remarkable ILP. The vec_unpackh() works fine on BE if dct is 16b aligned or vec_slo'ed, unfortunately no VSX here.
OT I examined the other x264de/quant*_altivec functions and they all show some optimization potential:
The QUANT_16_U/_DC macros for quant_4x4_dc, quant_4x4, and quant_8x8 do not exploit the fact that i_qbits == 16, and the msk & one is redundant. This could eliminate 8 instructions per iteration. The macro is executed once in the 4x4 functions, four times in 8x8.
The DEQUANT_SHL macro for dequant_4x4 and dequant_8x8 computes s16 = s16 * s16 eight times using mule/mulo/merge/merge/pack. One vec_mladd would accomplish this. The macro is executed twice in 4x4, eight times in 8x8.
The DEQUANT_SHR macro computes s16 s32 eight times and discards the 16 msb of the result using mule/mulo/sh/add twice. On POWER8/9 two vmuluwm (s32 = s32 s32) would accomplish this. The macro is executed twice in 4x4, eight times in 8x8.
x264_quant_4x4x4_altivec and x264_dequant_4x4_dc_altivec seem to be missing?
@msc4 some points from me:
So I don't think it makes sense to use this code. Have you tried my code after you replace the vec_unpackh with vec_unpackl ?
The vec_unpackh() works fine on BE if dct is 16b aligned or vec_slo'ed, unfortunately no VSX here.
ppccommon.h has the following definition of vec_vsx_ld:
#if !HAVE_VSX
#undef vec_vsx_ld
#define vec_vsx_ld(off, src)\
vec_perm(vec_ld(off, src), vec_ld(off + 15, src), vec_lvsl(off, src))
/* more code */
#endif
So, I believe that by using vec_vsx_ld will work even in non VSX machines.
When submitting code you should keep the same style as the other code. With PPC you should use the ppccommon.h file when possible.
For the rest of the quant functions I believe you are right. Can you create a task or issue, and then a pull request? This issue is focused on quant_2x2_dc_c.
I configured with --disable-vsx, gcc is 8.3.0. The lu-zero/x264/master branch didn't compile:
common/ppc/ppccommon.h:153:17: warning: implicit declaration of function ‘vec_xxpermdi’; did you mean ‘vec_perm’? [-Wimplicit-function-declaration] vec_vsx_st( vec_xxpermdi( v, vec_vsx_ld( 0, p ), 1 ), 0, p ) common/ppc/ppccommon.h:292:34: note: in definition of macro ‘vec_vsx_st’ vec_u8_t _v = (vec_u8_t)(v); \ common/ppc/mc.c:69:9: note: in expansion of macro ‘VEC_STORE8’ VEC_STORE8(src1v, dst); common/ppc/mc.c:69:9: error: AltiVec argument passed to unprototyped function
So I went with an older branch which has a different VEC_STORE8 for vec_u8_t data using vec_perm. I had to modify your code to work with this VEC_STORE8, didn't change anything else. The vec_vsx_ld/st macros work indeed and the vec_unpackh still gets OK from checkasm. When I change it to vec_unpackl checkasm fails. This is a modified checkasm testing quant_2x2_dc twice with dct a multiple of 0 or 8 bytes.
Yes vec_xxpermdi
is only available with VSX. We have to update the header file.
Hi! I couldn't resist a quick look at this function and since you don't seem to appreciate code contributions let me just state it's broken because it assumes dct[] is 16b aligned, which is not the case when it's called from encoder/rdo.c line 660 or encoder/macroblock.c line 1070. checkasm.c does not challenge that assumption and the ISA hides the problem by truncating addresses. Not well optimized either, that's why the C version is faster.