No SIMD converters - Githubissues

ast commented 5 years ago

There doesn't seem to be any SIMD (vectorized) format converts in SoapySDR at the moment but drivers are able to create and register them using SoapySDR::ConverterRegistry::VECTORIZED.

It's a huge undertaking writing format converters for many platforms. Perhaps using libvolk as an optional dependency would be an easy way to significantly improve performance?

guruofquality commented 5 years ago

There doesn't seem to be any SIMD (vectorized) format converts in SoapySDR at the moment but drivers are able to create and register them using SoapySDR::ConverterRegistry::VECTORIZED

It was the intention to be able to extend this latter and have converter priorities like this.

It's a huge undertaking writing format converters for many platforms. Perhaps using libvolk as an optional dependency would be an easy way to significantly improve performance?

Its definitely a lot of work, but its not nearly as huge as volk, and it depends how complete you need to be. The converters are really just a small set of converting between float/uint16/uint8, and maybe a few for un/packing interleaved data. And that would probably be some SSE variant, probably AVX, and NEON. Theres a lot of overlap amongst those x86 based implementations, and its not clear that theres any difference between SSE variants with the same register size. So its not something that has to always grow and be constantly be maintained in any case. And I dont think the dynamic architecture selection really has to be done at all. A lot of these applications using SIMD just assume if you are one x86, you have at least SSE2. Otherwise your processor is old or weird or virtualized and you should just rebuild with custom flags. :-P

GCC got a lot better at offering utilities to check processor capabilities like this at runtime, so you can avoid crashing and burning: https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/X86-Built-in-Functions.html

volk is an option. It was the stated intention of the original contributors and mine as well to license volk as LGPL, but somehow the gr-leadership botched that. It sucks because we ended up just duplicating a bunch of effort for uhd stream converters at the time anyway. Maybe I should have tried harder.

-- Anyway it just means that a soapy support module providing simd converters around volk has to be a separate project, not an optional build component to avoid any license ambiguity. And it would be less useful for those commercial folks because of the licensing.

If someone volunteers a volk support module, thats great, I would link to it, package it, build it. But if its going to take effort on my part to maintain, I would rather just get started on the first option and implement a few highly-desired converters.

Possibly, we could use some of the SIMD from liquiddsp as well.

Its like the first time anyone mentioned it, and some drivers like airspy or limesdr for example roll their own converters internally anyway. So I thought it was a lower priority thing anyway. But what sort of converters are you looking for. If your application could take advantage of them right now, which formats would it be, which architecture/OS?

ast commented 5 years ago

Well I wrote my own float32 <-> int32 and float32 <-> int16 for neon because they were missing in volk and I also needed them for my custom SDR soapy driver. Got about 2-4x speedup.

Sent pull request to volk but so far they have been completely ignored which is a shame. Not sure what's going on there....

I'd be happy to contribute them to Soapy.

Don't know SSE or AVX though..

On Sun, Jan 6, 2019, 22:41 Josh Blum <notifications@github.com wrote:

There doesn't seem to be any SIMD (vectorized) format converts in SoapySDR at the moment but drivers are able to create and register them using SoapySDR::ConverterRegistry::VECTORIZED

It was the intention to be able to extend this latter and have converter priorities like this.

It's a huge undertaking writing format converters for many platforms. Perhaps using libvolk as an optional dependency would be an easy way to significantly improve performance?

Its definitely a lot of work, but its not nearly as huge as volk, and it depends how complete you need to be. The converters are really just a small set of converting between float/uint16/uint8, and maybe a few for un/packing interleaved data. And that would probably be some SSE variant, probably AVX, and NEON. Theres a lot of overlap amongst those x86 based implementations, and its not clear that theres any difference between SSE variants with the same register size. So its not something that has to always grow and be constantly be maintained in any case. And I dont think the dynamic architecture selection really has to be done at all. A lot of these applications using SIMD just assume if you are one x86, you have at least SSE2. Otherwise your processor is old or weird or virtualized and you should just rebuild with custom flags. :-P

GCC got a lot better at offering utilities to check processor capabilities like this at runtime, so you can avoid crashing and burning: https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/X86-Built-in-Functions.html

volk is an option. It was the stated intention of the original contributors and mine as well to license volk as LGPL, but somehow the gr-leadership botched that. It sucks because we ended up just duplicating a bunch of effort for uhd stream converters at the time anyway. Maybe I should have tried harder.

-- Anyway it just means that a soapy support module providing simd converters around volk has to be a separate project, not an optional build component to avoid any license ambiguity. And it would be less useful for those commercial folks because of the licensing.

If someone volunteers a volk support module, thats great, I would link to it, package it, build it. But if its going to take effort on my part to maintain, I would rather just get started on the first option and implement a few highly-desired converters.

Possibly, we could use some of the SIMD from liquiddsp as well.

Its like the first time anyone mentioned it, and some drivers like airspy or limesdr for example roll their own converters internally anyway. So I thought it was a lower priority thing anyway. But what sort of converters are you looking for. If your application could take advantage of them right now, which formats would it be, which architecture/OS?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pothosware/SoapySDR/issues/203#issuecomment-451777479, or mute the thread https://github.com/notifications/unsubscribe-auth/AAH8Bt-JgiAzJHHvOUik1AsUNao0DdpEks5vAm2AgaJpZM4Zyl6z .

ast commented 5 years ago

I guess the big advantage would be cleaner drivers that potentially could perform well on both ARM and x86 with minimal effort.

guruofquality commented 5 years ago

Yea, cool sounds good. Can you come up with a pull request with -- NeonConverters.cpp, we can figure out the flags or API hooks as needed.

I guess the big advantage would be cleaner drivers that potentially could perform well on both ARM and x86 with minimal effort.

SIMD would be a big encouraging factor to start using this converters API. And good momentum for developing basic unit tests for the converters, and filling in a few super common ones with SSE/AVX something.

ncorgan commented 4 years ago

https://github.com/ncorgan/SoapyVOLKConverters

This is barebones, no testing yet, but it wraps all of <volk/volk*_convert\*.h>. It should be simple to package and maintain once it's pulled into the organization.

ncorgan commented 4 years ago

The VOLK converter module I mentioned above should do the job. Included in the build is a benchmark utility whose output I included below. Keep in mind that the raw times below are on a crappy 7-year-old laptop. Pay more attention to the ratios between the times.

SoapyVOLKConverters e72e310
SoapySDR            0.8.0-ga489f3dc
VOLK                2.0

Stats:
 * Buffer size:  16384
 * # iterations: 10000
 * Scalar ratio: 10

CS16 -> CF32
Generic:    33.384 us (MAD = 2.025 us)
Vectorized: 15.295 us (MAD = 1.118 us)
2.18267x faster

S16 -> S8
Generic:    22.978 us (MAD = 0.978 us)
Vectorized: 3.492 us (MAD = 0.279 us)
6.58018x faster

CF32 -> CS16
Generic:    71.518 us (MAD = 2.514 us)
Vectorized: 14.527 us (MAD = 0.769 us)
4.92311x faster

S8 -> S16
Generic:    19.556 us (MAD = 0.559 us)
Vectorized: 3.353 us (MAD = 0.071 us)
5.83239x faster

S16 -> F32
Generic:    18.996 us (MAD = 0.767 us)
Vectorized: 5.518 us (MAD = 0.278 us)
3.44255x faster

F32 -> S16
Generic:    37.924 us (MAD = 1.607 us)
Vectorized: 25.771 us (MAD = 1.536 us)
1.47158x faster

F32 -> S8
Generic:    37.016 us (MAD = 1.327 us)
Vectorized: 26.051 us (MAD = 1.117 us)
1.42091x faster

S8 -> F32
Generic:    19.067 us (MAD = 0.768 us)
Vectorized: 4.959 us (MAD = 0.139 us)
3.84493x faster

F32 -> F64
Vectorized: 13.968 us (MAD = 0.977 us)

F64 -> F32
Vectorized: 14.667 us (MAD = 2.654 us)

F32 -> S32
Vectorized: 12.92 us (MAD = 1.396 us)

S32 -> F32
Vectorized: 8.73 us (MAD = 2.863 us)

zuckschwerdt commented 4 years ago

Those are interesting results. I use "naive" format conversions in some of my SDR stuff and e.g. also in SoapyPlutoSDR. I only really examined that on x86_64 but both GCC and Clang (C, not C++) already emit vectorized code. Can you examine the assembler code in your build to see if there is really no vectorization or if it's just layed out badly? (the S8/S16 hints that the compiler really didn't vectorize). Otherwise your results indicate this can still be optimized significantly. I need to look at e.g. the "plain" CS16 -> CF32 and the Volk tricks now to see where that 2x difference comes from.

zuckschwerdt commented 4 years ago

To clarify e.g. the volk_16i_convert_8i_u_sse2 looks mostly like volk_16i_convert_8i_generic on -O3. (For a quick check try the Compiler Explorer)

ncorgan commented 4 years ago

Sorry for the delay on my part. I’m running into an issue getting the machine up and running that will actually have decent and representative benchmarks. I’ll check what Volk machine resulted in those times to see what happened there.

On Jan 5, 2020, at 08:01, Christian W. Zuckschwerdt notifications@github.com wrote:

To clarify e.g. the volk_16i_convert_8i_u_sse2 looks mostly like volk_16i_convert_8i_generic on -O3. (For a quick check try the Compiler Explorer)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

ncorgan commented 4 years ago

On an actually performant machine, this is what I get. The trends are similar, just with a greater difference. It does appear to be the case for S16 -> S8 that the generic implementation wasn't vectorized, as you said. As for CS16 -> CF32:

Here's the CS16 -> CF32 Godbolt comparison: https://godbolt.org/z/kq8LQA


SoapyVOLKConverters 88f32d6
SoapySDR            0.8.0-gf722f9ce
VOLK                2.1

Stats:
 * Buffer size:  16384
 * # iterations: 10000
 * Scalar ratio: 10

CS16 -> CF32
Generic:    13.978 us (MAD = 0.019 us)
Vectorized: 2.606 us (MAD = 0.037 us)
Machine:    u_avx2
5.36378x faster

S16 -> S8
Generic:    11.568 us (MAD = 0.015 us)
Vectorized: 0.505 us (MAD = 0.007 us)
Machine:    a_avx2
22.9069x faster

CF32 -> CS16
Generic:    30.751 us (MAD = 0.467 us)
Vectorized: 2.605 us (MAD = 0.008 us)
Machine:    a_avx2
11.8046x faster

S8 -> S16
Generic:    8.823 us (MAD = 0.018 us)
Vectorized: 0.45 us (MAD = 0.007 us)
Machine:    generic
19.6067x faster

S16 -> F32
Generic:    6.998 us (MAD = 0.014 us)
Vectorized: 1.161 us (MAD = 0.017 us)
Machine:    a_avx2
6.02756x faster

F32 -> S16
Generic:    15.394 us (MAD = 0.333 us)
Vectorized: 8.977 us (MAD = 0.303 us)
Machine:    u_avx
1.71483x faster

F32 -> S8
Generic:    15.641 us (MAD = 0.324 us)
Vectorized: 8.93 us (MAD = 0.299 us)
Machine:    a_avx2
1.75151x faster

S8 -> F32
Generic:    6.754 us (MAD = 0.021 us)
Vectorized: 1.039 us (MAD = 0.012 us)
Machine:    generic
6.50048x faster

F32 -> F64
Vectorized: 2.811 us (MAD = 0.03 us)
Machine:    u_avx

F64 -> F32
Vectorized: 1.924 us (MAD = 0.011 us)
Machine:    a_avx

F32 -> S32
Vectorized: 3.823 us (MAD = 0.172 us)
Machine:    u_avx

S32 -> F32
Vectorized: 1.256 us (MAD = 0.014 us)
Machine:    a_avx2

ncorgan commented 4 years ago

@guruofquality Putting aside this analysis, once this (https://github.com/ncorgan/SoapyVOLKConverters) is deemed portable enough, would you be willing to host and package it? This should meet the needs of this issue, unfortunate GPLv3 aside.

In terms of MSVC testing, I don't have a system up and running that can test it against an MSVC-built VOLK, so I'd need assistance on that.

guruofquality commented 4 years ago

@guruofquality Putting aside this analysis, once this (https://github.com/ncorgan/SoapyVOLKConverters) is deemed portable enough, would you be willing to host and package it? This should meet the needs of this issue, unfortunate GPLv3 aside.

Yea I'll package it, not a problem. A few other projects that get packaged are also GPL depending upon their constituent libraries.

Everyone involved in volk's inception, including me planned on using LGPL. I don't know what happened, sorry man. I should have tried harder at the time.

In terms of MSVC testing, I don't have a system up and running that can test it against an MSVC-built VOLK, so I'd need assistance on that.

I will put it in PothosSDR and see what happens

Do you want me to host the project in this organization first? I gotta put the URLs in the build scripts.

ncorgan commented 4 years ago

Sure, let’s move it over. The only real TODO at this point is making the test and benchmark portable. Currently, they assume there’s a .so next to the executable.

zuckschwerdt commented 4 years ago

I did some comparison tests with naive code vs Volk now and got two details to watch so things don't slow down badly. Volk is consistently faster still, ballpark 50%.

In F32 conversions double constants for scaling will promote the whole operation to double. E.g. const double scaler will impact dst[i] = float(src[i]) * scaler;. We should perhaps truncated to float scale where the precission isn't needed.
In U/S conversions a bitshift doesn't mix well with an offset, as FMA won't be used. Looks good in SoapySDR already.

ncorgan commented 4 years ago

Thanks for taking a look. To be clear, you're saying you observed VOLK being 50% faster, even with the concerns you listed?

zuckschwerdt commented 4 years ago

I made some minimal code to explore how good GCC and Clang vectorize (-O3) naive loops. There are some pitfalls. In addition to those two above, using pointers wouldn't vectorize (as opposed to array access). I still need to run this on more platforms and compilers, but I'd say that AVX is a ballpark 50% improvement only available with Volk. In summary I'd say that simple generic loops work ok if we optimize O3 and are a little bit smart. But Volk is more flexible with the runtime choice of kernels and noticeably faster.

ncorgan commented 4 years ago

@guruofquality Per our discussion on moving stuff over, I've transferred ownership of https://github.com/ncorgan/SoapyVOLKConverters to you so you an move it into PothosWare.

nickoe commented 3 years ago

@ast

Sent pull request to volk but so far they have been completely ignored which is a shame. Not sure what's going on there....

Where is your pull request?

pothosware / SoapySDR

No SIMD converters #203