msc4 commented 5 years ago

Hi! I couldn't resist a quick look at this function and since you don't seem to appreciate code contributions let me just state it's broken because it assumes dct[] is 16b aligned, which is not the case when it's called from encoder/rdo.c line 660 or encoder/macroblock.c line 1070. checkasm.c does not challenge that assumption and the ISA hides the problem by truncating addresses. Not well optimized either, that's why the C version is faster.

lu-zero commented 5 years ago

I'm sure @sasshka appreciates your passive-aggressive comment.

malvanos commented 5 years ago

@msc4 some comments from me:

"You don't seem to appreciate code contributions": I didn't see any pull request.
You are saying that the function is not well optimized. You are welcome to submit a patch.
Your profile is new, you joined 26th of July and 28th you open this Issue with aggressive tone. This looks suspicious.

msc4 commented 5 years ago

@malvanos: I was ready to prepare a patch when I saw your rejected pull requests. My patch would be just another marginal improvement, I can't target the POWER9, and I guess I won't be allowed to join the team.

Sorry about the passive-aggressiveness. Thought I had a good solution, what a disappointment.

malvanos commented 5 years ago

My quant_2x2 patch on x264 list was not a solution. I also rewrote the quant_2x2 function, but I was not able to get any performance gain using POWER9. So, I didn't bother sending another patch.

If you are talking about the zigzag_scan_4x4 discussion (it was related to the bountysource) you have to understand that other people before me (us) worked on this project and made a lot of contributions. So, they have priority.

Moreover, I have other patches pending, however I need to sign the Contributor License Agreement before they merge my code with the master of x264 (the official repository not this one).

If you have a good solution, you are welcome to open a request. If you don't have access to Power9, I can give you instructions how to get it.

Anyway no need to be aggressive :)

msc4 commented 5 years ago

This is what I had in mind. If it's useful, feel free to call it your own.

int x264_quant_2x2_dc_altivec( int16_t dct[4], int mf, int bias ) { vec_s32_t mfv = vec_splats( mf ); vec_s32_t biasv = vec_splats( bias ); vec_s32_t sh = vec_splats( 8 (int32_t)(long) dct ); vec_s32_t zero = vec_splats( 0 ); vec_s16_t dctv = vec_ld ( 0, dct ); vec_s32_t coef = vec_unpackh( vec_slo( dctv, (vec_u8_t) sh ) ); vec_s32_t sign = (vec_s32_t) vec_cmplt( coef, zero ); vec_s32_t abs_coef = vec_max( vec_sub( zero, coef ), coef ); abs_coef = vec_add( abs_coef, biasv ); abs_coef = vec_mulo( (vec_s16_t) abs_coef, (vec_s16_t) mfv ); abs_coef = vec_sr( abs_coef, (vec_u32_t) vec_splats( 16 ) ); coef = vec_sub( vec_xor( abs_coef, sign ), sign ); vec_s16_t outv = vec_packs( coef, coef ); vec_ste( (vec_s32_t) outv, 0, (int32_t ) dct ); vec_ste( (vec_s32_t) outv, 4, (int32_t *) dct ); return vec_any_ne( coef, zero ); }

malvanos commented 5 years ago

@msc4

Checkasm fails with gcc-8 on Power9. ( quant : [FAILED]). I will take a close look next week of your code.

msc4 commented 5 years ago

Is that a ppc64le? It worked on a 32 bit big endian box.

malvanos commented 5 years ago

Yes ppc64le!

Here is my proposed solution:

int x264_quant_2x2_dc_altivec( int16_t dct[4], int mf, int bias ) { LOAD_ZERO;

vec_s16_t temp1v = vec_vsx_ld(0, dct);      
vec_s32_t mf_v = vec_splats( (int32_t)mf );
static vec_s32_t vec16 = vec_splats( (int32_t) 16 );
vec_s32_t bias_v = vec_splats( (int32_t)bias );
vec_s32_t tmp3v = vec_unpackh(temp1v);

vec_s32_t vec_pos =   (( bias_v + (tmp3v)) * (mf_v) >> vec16);
vec_s32_t vec_neg = - (( bias_v - (tmp3v)) * (mf_v) >> vec16);

vector bool int isgtzero = vec_cmplt( tmp3v, zero_s32v );
vec_s32_t result_v = vec_sel( vec_pos, vec_neg, isgtzero);

vec_s16_t results_v = (vec_s16_t) vec_u32_to_u16( (vec_u32_t) result_v );
VEC_STORE8( results_v, dct);

return vec_any_ne(result_v, zero_s32v);

}

Can you try it in your machine?

In my machines I have:

On Power9: quant_2x2_dc_c: 73 quant_2x2_dc_altivec: 66
On Power8: quant_2x2_dc_c: 56 quant_2x2_dc_altivec: 46

malvanos commented 5 years ago

After some thinking I believe that this line:

vec_s32_t tmp3v = vec_unpackh(temp1v);

will create issues in big endian machines. @msc4 Can you replace it with vec_unpackl(temp1v) and verify If it works in your machine?

msc4 commented 5 years ago

How about this?

int x264_quant_2x2_dc_altivec( int16_t dct[4], int mf, int bias ) { vec_u16_t mfv = vec_splats( (uint16_t) mf ); vec_s16_t biasv = vec_splats( (int16_t) bias ); vec_s16_t zero = vec_splats( (int16_t) 0 );

if HAVE_VSX && !defined(WORDS_BIGENDIAN)

vec_s16_t dctv = vec_xl( 0, dct );
vec_s16_t db = vec_mergeh( biasv, dctv );

else

vec_s16_t dctv = vec_ld( 0, dct );
vec_u32_t sh = vec_splats( 8 * (uint32_t)(long) dct );

ifdef WORDS_BIGENDIAN

dctv = vec_slo( dctv, (vec_u8_t) sh );
vec_s16_t db = vec_mergeh( dctv, biasv );

else

dctv = vec_sro( dctv, (vec_u8_t) sh );
vec_s16_t db = vec_mergeh( biasv, dctv );

endif

vec_u32_t sign = (vec_u32_t) vec_cmplt( db, zero );
vec_u16_t abs_db = (vec_u16_t) vec_abs( db );
vec_u32_t pr = vec_msum( abs_db, mfv, sign );
vec_s32_t coef = (vec_s32_t) vec_xor( pr, sign );
coef = vec_sra( coef, (vec_u32_t) vec_splats( 16 ) );
vec_s16_t outv = vec_packs( coef, coef );
vec_ste( (vec_s32_t) outv, 0, (int32_t *) dct );
vec_ste( (vec_s32_t) outv, 4, (int32_t *) dct );
return vec_any_ne( coef, (vec_s32_t) zero );

}

msc4 commented 5 years ago

@malvanos Your solution has remarkable ILP. The vec_unpackh() works fine on BE if dct is 16b aligned or vec_slo'ed, unfortunately no VSX here.

msc4 commented 5 years ago

OT I examined the other x264de/quant*_altivec functions and they all show some optimization potential:

The QUANT_16_U/_DC macros for quant_4x4_dc, quant_4x4, and quant_8x8 do not exploit the fact that i_qbits == 16, and the msk & one is redundant. This could eliminate 8 instructions per iteration. The macro is executed once in the 4x4 functions, four times in 8x8.

The DEQUANT_SHL macro for dequant_4x4 and dequant_8x8 computes s16 = s16 * s16 eight times using mule/mulo/merge/merge/pack. One vec_mladd would accomplish this. The macro is executed twice in 4x4, eight times in 8x8.

The DEQUANT_SHR macro computes s16 s32 eight times and discards the 16 msb of the result using mule/mulo/sh/add twice. On POWER8/9 two vmuluwm (s32 = s32 s32) would accomplish this. The macro is executed twice in 4x4, eight times in 8x8.

x264_quant_4x4x4_altivec and x264_dequant_4x4_dc_altivec seem to be missing?

malvanos commented 5 years ago

@msc4 some points from me:

Your code works, but it is much slower in Power9: quant_2x2_dc_c: 74 quant_2x2_dc_altivec: 90.

So I don't think it makes sense to use this code. Have you tried my code after you replace the vec_unpackh with vec_unpackl ?

The vec_unpackh() works fine on BE if dct is 16b aligned or vec_slo'ed, unfortunately no VSX here.

ppccommon.h has the following definition of vec_vsx_ld:

#if !HAVE_VSX
#undef vec_vsx_ld
#define vec_vsx_ld(off, src)\
     vec_perm(vec_ld(off, src), vec_ld(off + 15, src), vec_lvsl(off, src))
/* more code */
#endif

So, I believe that by using vec_vsx_ld will work even in non VSX machines.

When submitting code you should keep the same style as the other code. With PPC you should use the ppccommon.h file when possible.
For the rest of the quant functions I believe you are right. Can you create a task or issue, and then a pull request? This issue is focused on quant_2x2_dc_c.

msc4 commented 5 years ago

I configured with --disable-vsx, gcc is 8.3.0. The lu-zero/x264/master branch didn't compile:

common/ppc/ppccommon.h:153:17: warning: implicit declaration of function ‘vec_xxpermdi’; did you mean ‘vec_perm’? [-Wimplicit-function-declaration] vec_vsx_st( vec_xxpermdi( v, vec_vsx_ld( 0, p ), 1 ), 0, p ) common/ppc/ppccommon.h:292:34: note: in definition of macro ‘vec_vsx_st’ vec_u8_t _v = (vec_u8_t)(v); \ common/ppc/mc.c:69:9: note: in expansion of macro ‘VEC_STORE8’ VEC_STORE8(src1v, dst); common/ppc/mc.c:69:9: error: AltiVec argument passed to unprototyped function

So I went with an older branch which has a different VEC_STORE8 for vec_u8_t data using vec_perm. I had to modify your code to work with this VEC_STORE8, didn't change anything else. The vec_vsx_ld/st macros work indeed and the vec_unpackh still gets OK from checkasm. When I change it to vec_unpackl checkasm fails. This is a modified checkasm testing quant_2x2_dc twice with dct a multiple of 0 or 8 bytes.

malvanos commented 5 years ago

Yes vec_xxpermdi is only available with VSX. We have to update the header file.

lu-zero / x264

x264_quant_2x2_dc_altivec() fix #15

if HAVE_VSX && !defined(WORDS_BIGENDIAN)

else

ifdef WORDS_BIGENDIAN

else

endif

endif