Voodoo emulation with SSE2 instructions

ghost commented 7 years ago

MAME code has SSE2 instructions in its voodoo emulation. It is currently organized among header files and C++ classes, but the older 2015 version was not yet fully refactored. To test its performance, it is possible to add these SSE2 instructions to the inner loop of the voodoo emulation in dosbox-x.

For initial testing, could port SSE2 to TEXTURE_PIPELINE in voodoo_data.h. The corresponding MAME functions are genTexture and combineTexture. Given bilinear filtering is off, it is possible to just work on combineTexture.

Major functions are:

        rgbaint_sub(&tmpA, &tmpB);
    rgba_to_rgbaint(&tmpB, (rgb_t) c_local.u);
    rgbaint_add_imm(&tmpB, 1);
    rgba_to_rgbaint(&tmpC, (rgb_t) add_val.u);
    rgbaint_scale_channel_add_and_clamp(&tmpA, &tmpB, &tmpC);
        result = rgbaint_to_rgba(&tmpA);

And the SSE2 by gcc intrinsics:

/*-------------------------------------------------
    rgba_to_rgbaint - converts a packed quad of RGB
    components to an rgbint type
-------------------------------------------------*/

INLINE void rgba_to_rgbaint(rgbaint *rgb, rgb_t color)
{
    *rgb = _mm_unpacklo_epi8(_mm_cvtsi32_si128(color), _mm_setzero_si128());
}

/*-------------------------------------------------
    rgba_comp_to_rgbint - converts a quad of RGB
    components to an rgbint type
-------------------------------------------------*/

INLINE void rgba_comp_to_rgbaint(rgbaint *rgb, INT16 a, INT16 r, INT16 g, INT16 b)
{
    *rgb = _mm_set_epi16(0, 0, 0, 0, a, r, g, b);
}

/*-------------------------------------------------
    rgbaint_sub - subtract two rgbaint values
-------------------------------------------------*/

INLINE void rgbaint_sub(rgbaint *color1, const rgbaint *color2)
{
    *color1 = _mm_sub_epi16(*color1, *color2);
}

/*-------------------------------------------------
    rgbaint_add_imm - add immediate INT16 to rgbaint value
-------------------------------------------------*/
INLINE void rgbaint_add_imm(rgbaint *color1, const INT16 imm)
{
    __m128i temp = _mm_set_epi16(0, 0, 0, 0, imm, imm, imm, imm);
    *color1 = _mm_add_epi16(*color1, temp);
}

INLINE void rgbaint_scale_immediate_add_and_clamp(rgbaint *color1, INT16 colorscale, const rgbaint color2)
{
    // color2 will get mutiplied by 2^8 (256) and then divided by 2^8 by the shift by 8
    __m128i mscale = _mm_unpacklo_epi16(_mm_set1_epi16(colorscale), _mm_set_epi16(0, 0, 0, 0, 256, 256, 256, 256));
    *color1 = _mm_unpacklo_epi16(*color1, *color2);
    *color1 = _mm_madd_epi16(*color1, mscale);
    *color1 = _mm_srli_epi32(*color1, 8);
    *color1 = _mm_packs_epi32(*color1, *color1);
    *color1 = _mm_min_epi16(*color1, *(__m128i *)&rgbsse_statics.maxbyte);
}

INLINE void rgbaint_scale_channel_add_and_clamp(rgbaint *color1, const rgbaint *colorscale, const rgbaint *color2)
{
    // color2 will get mutiplied by 2^8 (256) and then divided by 2^8 by the shift by 8
    __m128i mscale = _mm_unpacklo_epi16(*colorscale, _mm_set_epi16(0, 0, 0, 0, 256, 256, 256, 256));
    *color1 = _mm_unpacklo_epi16(*color1, *color2);
    *color1 = _mm_madd_epi16(*color1, mscale);
    *color1 = _mm_srli_epi32(*color1, 8);
    *color1 = _mm_packs_epi32(*color1, *color1);
    *color1 = _mm_min_epi16(*color1, *(__m128i *)&rgbsse_statics.maxbyte);
}

extern const struct _rgbsse_statics
{
    __m128  dummy_for_alignment;
    INT16   maxbyte[8];
    INT16   scale_table[256][8];
} rgbsse_statics;

These are the lines from TEXTURE_PIPELINE before the SSE2 changes:

    /* do the blend */
    //tr = (tr * (blendr + 1)) >> 8;
    //tg = (tg * (blendg + 1)) >> 8;
    //tb = (tb * (blendb + 1)) >> 8;
    //ta = (ta * (blenda + 1)) >> 8;

    /* clamp */
    //result.rgb.r = (tr < 0) ? 0 : (tr > 0xff) ? 0xff : tr;
    //result.rgb.g = (tg < 0) ? 0 : (tg > 0xff) ? 0xff : tg;
    //result.rgb.b = (tb < 0) ? 0 : (tb > 0xff) ? 0xff : tb;
    //result.rgb.a = (ta < 0) ? 0 : (ta > 0xff) ? 0xff : ta;

The blend uses SSE2 for the addition and then another SSE2 function for the multiple, divide, and clamping of values.

joncampbell123 commented 7 years ago

It is possible. configure.ac needs to provide SSE2 detection, to then provide a #define in config.h to allow the voodoo emulation to use an #if block for optimized SSE2 blending. Make sure your modification includes the right headers to enable SSE2 intrinsics.

ghost commented 7 years ago

Thank you. I've read that the configure line must have the SSE2 enabled and I think even the -march= to a SSE2 supported CPU. And the header file, too.

A guide for SSE2 math: https://github.com/JaapSuter/Ten18/blob/master/lib/DirectXTex/XNAMath/xnamath.h

ghost commented 7 years ago

It may be possible to enable and verify auto-vectorization by gcc of the guetzli software. It may require inspecting the machine code and possibly applying restrict and __builtin_assume_aligned: https://locklessinc.com/articles/vectorize/

ghost commented 7 years ago

First try at documenting the SSE2 instructions from the code:

INLINE void rgbaint_scale_channel_add_and_clamp(rgbaint *color1, const rgbaint *colorscale, const rgbaint *color2)
{
    // color2 will get mutiplied by 2^8 (256) and then divided by 2^8 by the shift by 8

    // interleave *colorscale's 4x lower 16-bit portion (from prior blendr/g/b/a + 1) with values of 256
    // so should only interleave the 256 values with *colorscale's values
    // mscale is mapped to 128-bit integer register
    __m128i mscale = _mm_unpacklo_epi16(*colorscale, _mm_set_epi16(0, 0, 0, 0, 256, 256, 256, 256));

    // interleaves values pointed to in *color1 (from subtraction with c_local) with those from *color2
    // so, will store first 4 values of each; total 8x 16-bit values
    *color1 = _mm_unpacklo_epi16(*color1, *color2);

    // multiply 4 of the 8 16-bit values by 256
    // example: r0 := (a0 * b0) + (a1 * b1), so results in 4x 32-bit values where bx where x is even or odd are 0 or 256
    // must be for alignment to 128-bit boundary
    *color1 = _mm_madd_epi16(*color1, mscale);

    // shift right each of 4x 32-bit values by 8
    *color1 = _mm_srli_epi32(*color1, 8);

    // packs 8x 32-bit values which results from packing *color1 2x and stores in 16-bit values pointed to by *color1
    *color1 = _mm_packs_epi32(*color1, *color1);

    // calculate minima for the 8x 16-bit values pointed to by *color1 with the 16-bit values stored in maxbyte array
    // this should be the maxbyte array: { 255, 255, 255, 255, 255, 255, 255, 255 }
    *color1 = _mm_min_epi16(*color1, *(__m128i *)&rgbsse_statics.maxbyte);
}

I will correct this as needed.

I also missed a function earlier:

/*-------------------------------------------------
    rgbaint_to_rgba - converts an rgbint back to
    a packed quad of RGB values
-------------------------------------------------*/

INLINE rgb_t rgbaint_to_rgba(const rgbaint *color)
{
    return _mm_cvtsi128_si32(_mm_packus_epi16(*color, *color));
}

Also, verified that -msse2 should be sufficient at configure line and -march is not also necessary. Thank you for the hints! And an example for including a SSEx header file: #include

Last, hints for adapting the code for SSE2:

Declare variables at start of function:
     //INT32 blendr, blendg, blendb, blenda;
     //INT32 tr, tg, tb, ta;
     UINT8 a_other = c_other.rgb.a;
     UINT8 a_local = c_local.rgb.a;
     rgb_union add_val = c_local;
     UINT8 tmp;
     rgbaint tmpA, tmpB, tmpC;

     rgb_union c_other; // these are in COLORPATH_PIPELINE already
     rgb_union c_local;

Select zero/other for RGB
    tr = COTHER.rgb.r;                                                      \
    tg = COTHER.rgb.g;                                                      \
    tb = COTHER.rgb.b;                                                      \
    }                                                                           \
    else                                                                        \
        tr = tg = tb = 0;                                                       \
to
    c_other.u &= 0xff000000; \

Select zero/other for alpha
        ta = COTHER.rgb.a;                                                      \
    else                                                                        \
        ta = 0;                                                                 \
to
    c_other.u &= 0x00ffffff;

ghost commented 7 years ago

Compressed calculations from the relevant blending functions to mimic an effect of blending for free (in TEXTURE_PIPELINE):

    /* do the blend */                                                          
    tr = tr;                                                
    tg = tg;                                                
    tb = tb;                                                
    ta = ta;                                                

    /* clamp */                                                                 
    RESULT.rgb.r = (UINT8)tr;               
    RESULT.rgb.g = (UINT8)tg;               
    RESULT.rgb.b = (UINT8)tb;               
    RESULT.rgb.a = (UINT8)ta;

Timedemo demo1 from GLQuake1 @ 640x480 with above code (-msse2): 157 seconds. Normal 32-bit build: 158 seconds.

It seems that there is no bottleneck in that section of the inner loop of the Voodoo emulation, although the bottleneck should be somewhere in that emulation since the opengl path runs fast. From looking at TEXTURE_PIPELINE, the cost of the calculations do not appear concentrated in any one section either.

I've saved some time by removing the mipmapping calculations given the mipmapping was disabled, perhaps it was 10% or more. So, that section must use a fair amount of cpu time.

joncampbell123 commented 7 years ago

Whoah! box-tortoise what happened to your account?

joncampbell123 / dosbox-x

Voodoo emulation with SSE2 instructions #250