Closed alaingdl closed 6 months ago
I confirm that -fsanitize=address
makes gdl 100 times faster for code related to memory transfer (copy from variable to variable) on a Mac mini with M1.
The code to be tested is simple:
GDL> tic & for i=1L,600000 do a=1 & toc
% Time elapsed : 4.5299740 seconds.
which takes 0.057317972 seconds on my intel linux laptop, gcc compiler, no eigen:: As
GDL> tic & for i=1L,600000 do a=a & toc
% Time elapsed : 0.019397974 seconds.
is internally optimised to do nothing (a=a
!!!), 0.019397974 seconds measures the empty loop speed, which is OK.
This restricts the area of the problem to a very tiny number of code lines, essentially what happens in "a=1
".
@alaingdl the multiply defined symbol have already been encountered ( #677 , #734) , and should indeed be avoided. However there always were compiler options to circumvent that problem which arises only on a limited number of platforms.
CULPRIT FOUND!!!
On OSX, for obscure historical reasons, and given that the system defines HAVE_MALLOC_ZONE_STATISTICS and HAVE_MALLOC_MALLOC_H, the very very inner code for destruction of variables would call the obscure UpdateCurrent() function to report precise memory useage. The loss of time is tremendous, and would have been seen in a profiler by the enormous number of calls to strange functions like malloc_zone_statistics() etc.
making UpdateCurrent() just return solves the speed problem, time_test4 drops to 1 sec.
Just commited the single-liner that is supposed to do wonders.
@GillesDuvert : brilliant ! Thanks
tested on a intel OSX, using the script ...
GDL> time_test4
[...]
1.10098=Total Time, 0.021701576=Geometric mean, 25 tests.
GDL> TEST_LOOPS
% Time elapsed : 0.0098431110 seconds.
% Time elapsed : 0.010197878 seconds.
% Time elapsed : 0.0053970814 seconds.
% Time elapsed : 0.0092120171 seconds.
Congrats!!!!
On 2. Mar 2024, at 15:09, Giloo @.***> wrote:
CULPRIT FOUND!!!
On OSX, for obscure historical reasons, and given that the system defines HAVE_MALLOC_ZONE_STATISTICS and HAVE_MALLOC_MALLOC_H, the very very inner code for destruction of variables would call the obscure UpdateCurrent() function to report precise memory useage. The loss of time is tremendous, and would have been seen in a profiler by the enormous number of calls to strange functions like malloc_zone_statistics() etc.
making UpdateCurrent() just return solves the speed problem, time_test4 drops to 1 sec.
— Reply to this email directly, view it on GitHub https://github.com/gnudatalanguage/gdl/issues/1755#issuecomment-1974868089, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOC5K6HM546IOUDCHT5XCO3YWIIXDAVCNFSM6AAAAABDU7VVQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZUHA3DQMBYHE. You are receiving this because you are subscribed to this thread.
OK, the performance issues within FOR loops detected first on Mac M2/M3 is in fact also here on x86_64
The OSX versions here were compiled with the script, and OpenMP is declared as ON (all tests : 4, 5, 16, 25 are bad, but also 2 regress since clang 17 :(
OSX gdl-1.0.2git230313 : clang 15.0.7_1 time_test4 : 21.8063=Total Time
OSX gdl-1.0.2git230420 : clang 16.0.1 time_test4 : 21.9463=Total Time (case 2 0.206096 Foreach, 6000000 elements
OSX gdl-1.0.3git231123CMake: clang 17.0.4 time_test4 : 70.3176=Total Time (case 2 : 49.0947 Foreach, 6000000 elements)
OSX gdl-1.0.4git240222CMake: clang 17.0.6_1 time_test4 : 69.5246=Total Time
Unfortunately I cannot finish the compilation with GCC 13 because of duplicates symbols
datatypes.cpp.o
is always involved ...