Performance optimizations for BeagleBone Black/ARM Neon

ArcEye commented 6 years ago

Issue by machinekoder Sun May 10 17:11:28 2015 Originally opened as https://github.com/machinekit/machinekit/issues/629

Found some additional compiler options and also specific math functions for the ARM Neon fpu. It may not improve performance dramatically, but it does not cost much either: http://www.eliteraspberries.com/blog/2013/09/cflags-for-numerical-computing-on-the-beaglebone-black.html

and if https://github.com/machinekit/machinekit/issues/412 is related to a bug in the GCC math functions this may help too: http://gruntthepeon.free.fr/ssemath/neon_mathfun.html

ArcEye commented 6 years ago

Comment by RunningLight Mon May 11 17:56:03 2015

Regarding neon_mathfun_test, see https://gist.github.com/RunningLight/91246d5d1cc224cd008f

gcc 4.6.3 didn't like the ffast-math flag so I left it off.

ArcEye commented 6 years ago

Comment by machinekoder Mon May 11 18:48:37 2015

Wow, thats a factor of 10 compared to the gcc math functions. Definitely worth trying.

ArcEye commented 6 years ago

Comment by RunningLight Mon May 11 22:23:17 2015

I agree. At the same time, I'm aware that the trajectory planner uses the posemath library and I haven't looked yet to see if there's any problem related to it.

ArcEye commented 6 years ago

Comment by mhaberler Tue May 12 04:05:59 2015

well since there is some evidence as to the location of the delay hike and it is just cos(), inspecting cos() for spikes would be my first priority; by analogy, the other transcendentals as far as they are used in the code base warrant a look as well

taking a step back, I guess issues like this one are likely to reappear, so the question arises - what is the best strategy to deal with libm?

rely on compiler options only (might be hairy switching compilers, e.g. to clang)
wrap libm functions in static inlines which could be platform and/or library dependent

NB linuxcnc math support has seen some work which might be useful to adopt

ArcEye commented 6 years ago

Comment by machinekoder Tue May 12 07:43:34 2015

Use our own math library (e.g. cephes as is used in the mathfun link). Might cost performance on some platforms but is way more deterministic than the glib math functions since it does the same on all platforms.

ArcEye commented 6 years ago

Comment by mhaberler Tue May 12 07:51:53 2015

yes, forgot that option (in fact for kthreads flavors we do already have our own math library, so it would make sense to inspect the header magic which makes that happen and build on it)

now.. an evidence-based approach to this issue would be to write up pre-configure-time tests which automatically select the best option for a platform

this test would not be part of configure since this makes no sense when cross-building, but rather a test program which suggests the best possible configure options for a given platform - one would have to run that on target

this could well be a separate repo/project

ArcEye commented 6 years ago

Comment by machinekoder Wed May 13 20:35:30 2015

@RunningLight Tried on my BBB and got slightly better results. Using the latest gcc on Debian Wheezy. The cortex-a9 option has to be replaced by cortex-a8 for the BBB.

The cos_ps functions are calculating 4 values at the same time. So to make proper use of the full advantages one would need to modifiy the code accordingly. A SSE/x86 version is also available. However, the cephes functions also perform slightly better than the glib functions but they would require no additional work to be used. I tested them with the TP and the RT delay did not happen anymore. @mhaberler Maybe we should choose option 3?

ArcEye commented 6 years ago

Comment by RunningLight Wed May 13 21:01:54 2015

@strahlex I should have said I changed the flag.

I haven't taken on rtapi_get_clocks() yet. Too much end-of-school activity with grandkids:)

However, after some rather confusing results with a test styled on the quick test you posted, which iterated over a large domain (-100., +100) of input values, I cobbled up a test where, for a single fixed value, I count the number of times cos(value) can be computed between those 10000us 'ticks' of clock(). I don't feel confident enough in the weird outcome to share yet. For a start, I gather 10000 such counts. Roughly, the difference between the maximum and minimum counts on the i5 is about 4:1; on the BBB, 1000:1. In the case of the BBB, the lowest number of counts corresponds to cos() costing on the order of 1ms. I want to check my work before I post plots of the distributions.

All that aside, I think it very reasonable to choose option 3 rather than deal with the glibc functions.

ArcEye commented 6 years ago

Comment by mhaberler Tue May 19 12:19:21 2015

fyi @strahlex and me are planning to link in math functions as needed into the userland rtapi RT module (rtapi_main.c ff), and likely into ulapi library as well so both sides use the same math library

ArcEye commented 6 years ago

Comment by machinekoder Tue May 26 07:53:44 2015

The rtapi_math patch is here https://github.com/machinekit/machinekit/pull/652

ArcEye commented 6 years ago

Comment by mhaberler Wed Mar 16 08:21:07 2016

I assume this can be closed?

machinekit / machinekit-hal

Performance optimizations for BeagleBone Black/ARM Neon #134