parallella / pal

An optimized C library for math, parallel processing and data movement
Apache License 2.0
302 stars 110 forks source link

dsp: p_firsym: API issue and implementation #158

Open lfochamon opened 9 years ago

lfochamon commented 9 years ago

Although I think having a separate implementation specific for filters with symmetric coefficients is a great idea, there is the issue of dealing with the different types of symmetric FIR filters. For reference:

The way the function description is written today, I guess it would imply a type II filter. The implementation of all these FIR are very similar, but it should be defined whether it would make more sense to have (a) a single function that support all types of filters or (b) different functions.

For (a), the function interface needs to be updated to either pass symmetric/anti-symmetric and odd/even length parameters or the filter type. Option (b) would mean less branching and more straightforward code, although it could increase code footprint if someone used all 4 kinds of filters in a project (something really unlikely in practice).

I've written the code, but I guess I'll wait for a consensus before submitting a pull request.

aolofsson commented 9 years ago

Take a look at the "firs" function from TI and let me know what you think. Some of the usage restrictions in that function seem reasonable.

http://www.ti.com/lit/ug/spru422j/spru422j.pdf

lfochamon commented 9 years ago

OK, so firs constraints the filter to be type II. It seems a little restrictive, given the forced zero at -1. Personally, if I were to choose only one symmetric FIR to implement, it would be the type I. It does not constrain the transfer function and the implementation cost is virtually the same (there is 1 additional MAC). But I guess this is a designer judgement call.

Another possible compromise solution would be to make nh the full length of the filter to let the function know which type (I or II) the filter actually is. It alleviates the API at the cost of a little more testing inside the function. However, I would lean towards eliminating all checks inside these functions since filters usually sit inside the main loop.

TL;DR: make 4 functions (fir1, fir2, fir3, and fir4) or say p_firsym is for type I filters and nh = (L+1)/2+1, for a filter of length L odd. [my 2 cents]

BTW, now I get the idea behind passing *dbuf (or for the current p_fir, *dl). Although I guess all the arrays should be in local memory, otherwise it's gonna lead to more RAM accesses.

aolofsson commented 9 years ago

Is there even a point? I know symmetrical filters are good for ASICs and FPGAs, but in most microprocessors, if you have one op per clock cycle and (MUL, ADD, or MAC), then there might be a big gain in doing the symmetrical fir versus the standard fir?

Is the function essential?

lfochamon commented 9 years ago

Hmmm... That's true. I get stuck on an FPGA thinking, didn't even thought of that. But I've written function the p_firsym way for DSPs and I've seen them on libraries so that got me wondering.

Since it's the only way to really, I just coded p_firsym (with type II like TI's library, that doesn't matter for now) and tested it againt p_fir for differen nx and nh on my PC and the BeagleBone Black. I only ran tests with nh <= nx. I timed 1e7 runs on the PC and 1e6 runs on the BBB. The results are presented in the plots below.

On the PC, both function are roughly the same for small nh (no idea what happened with that oulier in the middle). For large nh, p_firsym does get run around 200 ns faster than p_fir.

On the ARM core, p_firsym is much faster than p_fir. Note that the functions run around 10x slower on the BBB, so the relative gains are more or less the same. I stopped the test in the middle because it was taking forever and the trend was already clear.

I cannot explain why, but aparrently p_firsym is worth implementing (as soon as the type issue is decided upon).

i7

bench_fir_firsym_i7 bench_fir_firsym_hist_i7

AM3558

bench_fir_firsym_bbb

bench_fir_firsym_hist_bbb

aolofsson commented 9 years ago

Great data. Not sure I completely understand the diff between type 1 and type 2 and why the TI on eis type2?

Are you ok with the restrictions on the number coefficients being 3?

lfochamon commented 9 years ago

I should've given an example from the start!

I think TI implemented type 2 because it is easier code. I lean towards type 1 because it has no restriction. For the example above, the function parameters would be: h = {1,2,3,4,5} and nh = 5. Then p_firsym would do its magic.

At this moment, the the code is written so that there are no restrictions on nh. For small nh, however, you might be better off using p_fir directly. Depends on the platform, though...