Unroll some tight loops

fventuri commented 2 months ago

Good idea Howard! I don't know how C++ compilers work these days, but perhaps a block like this:

            *output++ = float(*input++);
            *output++ = float(*input++);
            *output++ = float(*input++);
            *output++ = float(*input++);

could be replaced by something like this:

            const int16_t *in = input + 4 * m;
            float *out = output + 4 * m;
            out[0] = float(int[0]);
            out[1] = float(int[1]);
            out[2] = float(int[2]);
            out[3] = float(int[3]);

I don't know if from the point of the compiler the second option could use SIMD instructions, since there are no output+_ and input++ around, but perhaps the compiler figures it out since it is idiomatic.

There's also the volk library (https://www.libvolk.org/) that has a couple of functions called 'volk_16i_s32f_convert_32f' (https://www.libvolk.org/doxygen/volk_16i_s32f_convert_32f.html) and 'volk_16ic_convert_32fc' (https://www.libvolk.org/doxygen/volk_16ic_convert_32fc.html), which are optimized for different types of hardware, but I am not sure if its license (GPL v3) is compatible with this project's license.

Franco

howard0su commented 2 months ago

I got the idea from CMSIS-DSP. Ideally, hand write instruction can do a better job but hard to adopt to different SIMD solutions. I will do more experiment to see how difference they are.

On Sat, Jul 6, 2024 at 10:54 PM Franco Venturi @.***> wrote:

Good idea Howard! I don't know how C++ compilers work these days, but perhaps a block like this:
        *output++ = float(*input++);
        *output++ = float(*input++);
        *output++ = float(*input++);
        *output++ = float(*input++);
could be replaced by something like this:
        const int16_t *in = input + 4 * m;
        float *out = output + 4 * m;
        out[0] = float(int[0]);
        out[1] = float(int[1]);
        out[2] = float(int[2]);
        out[3] = float(int[3]);
I don't know if from the point of the compiler the second option could use SIMD instructions, since there are no output+_ and input++ around, but perhaps the compiler figures it out since it is idiomatic.

There's also the volk library (https://www.libvolk.org/) that has a couple of functions called 'volk_16i_s32f_convert_32f' ( https://www.libvolk.org/doxygen/volk_16i_s32f_convert_32f.html) and 'volk_16ic_convert_32fc' ( https://www.libvolk.org/doxygen/volk_16ic_convert_32fc.html), which are optimized for different types of hardware, but I am not sure if its license (GPL v3) is compatible with this project's license.

Franco

— Reply to this email directly, view it on GitHub https://github.com/ik1xpv/ExtIO_sddc/pull/235#issuecomment-2211789078, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAF3GRG3H7J2CTQAXZGSAVLZLAAL7AVCNFSM6AAAAABKOOA3BSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRG44DSMBXHA . You are receiving this because you authored the thread.Message ID: @.***>

-- -Howard

cozycactus commented 2 months ago

what do you think about fast int16_t to float conversion like this https://github.com/m-ou-se/floatconv when i did perf it showed conversion takes many cpu %

cozycactus commented 2 months ago

https://blog.m-ou.se/floats/

howard0su commented 2 months ago

This is interesting blog. I will look into it and port i16_to_float over.

On Sun, Jul 7, 2024 at 5:22 AM Ruslan Migirov @.***> wrote:

https://blog.m-ou.se/floats/

— Reply to this email directly, view it on GitHub https://github.com/ik1xpv/ExtIO_sddc/pull/235#issuecomment-2211972813, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAF3GRDEZWWXJHMFTVMP6RLZLBNXRAVCNFSM6AAAAABKOOA3BSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRHE3TEOBRGM . You are receiving this because you authored the thread.Message ID: @.***>

-- -Howard

ik1xpv / ExtIO_sddc

Unroll some tight loops #235