gogins / csound-extended

Extensions for Csound including algorithmic composition, Android app, and WebAssembly.
GNU Lesser General Public License v2.1
40 stars 1 forks source link

Investigate Csound multi-threading #164

Closed gogins closed 3 years ago

gogins commented 3 years ago

Document and research current speedups and what might be done to improve them. May become a pull request to Csound proper.

gogins commented 3 years ago
mkg@xenakis:~/csound-examples/csd$ csound trapped-high-resolution.csd -otest.wav
Elapsed time at end of performance: real: 28.720s, CPU: 28.591s
512 2048 sample blks of floats written to test.wav (WAV)

mkg@xenakis:~/csound-examples/csd$ csound trapped-high-resolution.csd -otest.wav -j4
Elapsed time at end of performance: real: 236.495s, CPU: 617.696s
512 2048 sample blks of floats written to test.wav (WAV)

This is certainly atrocious.

Elapsed time at end of performance: real: 29.181s, CPU: 29.049s
512 2048 sample blks of floats written to test.wav (WAV)
mkg@xenakis:~/csound-examples/csd$ csound cloud-strata.csd -otest.wav

mkg@xenakis:~/csound-examples/csd$ csound cloud-strata.csd -otest.wav -j4
Elapsed time at end of performance: real: 16.318s, CPU: 47.446s
512 2048 sample blks of floats written to test.wav (WAV)

This is roughly what the Wikipedia article claims.

gogins commented 3 years ago

My CFLAGS:

mkg@xenakis:~/csound-examples/csd$ csound trapped-high-resolution.csd -otest.wav
Elapsed time at end of performance: real: 28.310s, CPU: 28.228s
512 2048 sample blks of floats written to test.wav (WAV)

Standard CFLAGS:

mkg@xenakis:~/csound-examples/csd$ csound trapped-high-resolution.csd -otest.wav
Elapsed time at end of performance: real: 52.184s, CPU: 52.056s
512 2048 sample blks of floats written to test.wav (WAV)
gogins commented 3 years ago

I'm going to profile this again and see what is happening.

gogins commented 3 years ago

I have profiled. The dag_get_task function is the main source of overhead (roughy half!) in multi-threaded rendering and in particular within that, calls to ATOMIC_READ, ATOMIC_WRITE, and ATOMIC_CAS. I do not (yet?) understand this code.

gogins commented 3 years ago

I have rewritten dag_get_task in my preferred textbook style:

Elapsed time at end of performance: real: 226.502s, CPU: 602.506s
512 2048 sample blks of floats written to temp.wav (WAV)
mkg@xenakis:~/csound-examples/csd$ csound trapped-high-resolution.csd -j4 -otemp.wav

This provides a modest speedup.

gogins commented 3 years ago

Not sure what happened here but:

Elapsed time at end of performance: real: 1.377s, CPU: 3.826s
512 2048 sample blks of floats written to test.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -j4 -otest.wav

versus

Elapsed time at end of performance: real: 2.232s, CPU: 2.228s
512 2048 sample blks of floats written to test.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -otest.wav
tjingboem commented 3 years ago

i have never had better results with -j4 or -j8 so i never say the computer to do multicore with Csound. There are a few papers on the subject, and i collected them on my webpage: http://www.geluidsmanvanhetnoorden.nl/index.php?p=1_7_Sound-Matters

It seems that gluing the data streams together again after being processed with multicore, takes a lot of time...

On Tue, Jun 1, 2021 at 3:54 PM Michael Gogins @.***> wrote:

Not sure what happened here but:

Elapsed time at end of performance: real: 1.377s, CPU: 3.826s 512 2048 sample blks of floats written to test.wav (WAV) vst3_host_t::~vst3_host_t. @.***:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -j4 -otest.wav

versus

Elapsed time at end of performance: real: 2.232s, CPU: 2.228s 512 2048 sample blks of floats written to test.wav (WAV) vst3_host_t::~vst3_host_t. @.***:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -otest.wav

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gogins/csound-extended/issues/164#issuecomment-852144142, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABS4UP7AFHXNHM3DQDQSVWDTQTQ7XANCNFSM44EO76SA .

gogins commented 3 years ago

As you see I am now getting much more favorable results in my tests. I will try to find out why. I don't see anything obvious in recent changes to Csound code.

gogins commented 3 years ago
Elapsed time at end of performance: real: 28.412s, CPU: 28.316s
512 2048 sample blks of floats written to test.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/cloud-strata.csd -otest.wav

versus

Elapsed time at end of performance: real: 15.196s, CPU: 43.758s
512 2048 sample blks of floats written to test.wavq (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/cloud-strata.csd -j4 -otest.wav
tjingboem commented 3 years ago

That looks interesting. Your CFLAGS seems to do something for realtime use.

gogins commented 3 years ago

Without -march=native -Ofast:

Elapsed time at end of performance: real: 3.327s, CPU: 10.916s
512 2048 sample blks of floats written to test.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -j4 -otest.wav

versus:

Elapsed time at end of performance: real: 7.745s, CPU: 7.726s
512 2048 sample blks of floats written to test.wavq (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -otest.wavq

This is still far better than we originally saw. I will try git bisect to find out what commit improved things so much.

gogins commented 3 years ago

I have done the git bisect without learning anything. I started with commit 670d093eb1d04bd7eb510af83d597473300546fd which is May 2, my first tests were May 5.

Performance is consistently good in all commits and again far better than my first tests.

I must be missing something. It is logically possible that the cause is some improvement in the operating system or the C library but that does not seem likely to me. I have recently updated Ubuntu to version 20.04.2 and also updated the entire "C" toolchain.

In any event I can no longer reproduce the problem. The -j option appears to be performing as advertised.

gogins commented 3 years ago

One more thing to check, I had done this:

diff --git a/include/csoundCore.h b/include/csoundCore.h
index d18318419..3312524e4 100644
--- a/include/csoundCore.h
+++ b/include/csoundCore.h
@@ -1774,7 +1774,7 @@ typedef struct _message_queue_t_ {
     int           dag_changed;
     int           dag_num_active;
     INSDS         **dag_task_map;
-    volatile stateWithPadding    *dag_task_status;
+    stateWithPadding    *dag_task_status;
     watchList     * volatile *dag_task_watch;
     watchList     *dag_wlmm;
     char          **dag_task_dep;

I am reverting that now... makes no difference.

tjingboem commented 3 years ago

--Csound version 6.16 (double samples) Jun 1 2021

csound /media/menno/datae/Downloads/csound-extended-develop/test-examples/csound/xanadu-high-resolution.csd -otest.wav
Elapsed time at end of performance: real: 41.762s, CPU: 40.291s
csound /media/menno/datae/Downloads/csound-extended-develop/test-examples/csound/xanadu-high-resolution.csd -j8 -otest2.wav
Elapsed time at end of performance: real: 102.826s, CPU: 432.621s

menno@mennoASUSZ170 ~ $ gcc --version gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

gogins commented 3 years ago

Thanks for the information. I'm really mystified.

On Wed, Jun 2, 2021, 07:53 Menno Knevel @.***> wrote:

--Csound version 6.16 (double samples) Jun 1 2021

csound /media/menno/datae/Downloads/csound-extended-develop/test-examples/csound/xanadu-high-resolution.csd -otest.wav Elapsed time at end of performance: real: 41.762s, CPU: 40.291s

csound /media/menno/datae/Downloads/csound-extended-develop/test-examples/csound/xanadu-high-resolution.csd -j8 -otest2.wav Elapsed time at end of performance: real: 102.826s, CPU: 432.621s

@.***ASUSZ170 ~ $ gcc --version gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/gogins/csound-extended/issues/164#issuecomment-852963012, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQIGJORNAUDEC5VCKQXRQDTQYLSTANCNFSM44EO76SA .

tjingboem commented 3 years ago

your computer is 10x faster...can i buy it :-) ?

gogins commented 3 years ago

If you do not have more than 1 core, your results make sense.

My computer is an Intel NUC, a small computer about the size of a paperback book. It has 4 cores and a solid state disk, and plenty of RAM. It is somewhat faster than average for a contemporary personal computer, but gaming computers and workstations can be significantly faster.

What is your operating system vendor and version, what is the model of your computer?

On Wed, Jun 2, 2021, 08:22 Menno Knevel @.***> wrote:

your computer is 10x faster...can i buy it :-) ?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/gogins/csound-extended/issues/164#issuecomment-852981220, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQIGJMPG7VVFCRRC7HLL7TTQYPBBANCNFSM44EO76SA .

tjingboem commented 3 years ago

i thought i had a decent desktop, and it is not very old...

Screenshot from 2021-06-02 14-46-54

gogins commented 3 years ago

Your machine should indeed be faster than mine, do you have a solid state disk?

Also, try again with -j4, not -j8. You really only have 4 cores. Intel has this feature called "hyper-threading" where to some extent one core can run two threads, but I don't think it helps us here.


Michael Gogins Irreducible Productions http://michaelgogins.tumblr.com Michael dot Gogins at gmail dot com

On Wed, Jun 2, 2021 at 8:49 AM Menno Knevel @.***> wrote:

i thought i had a decent desktop, and it is not very old...

[image: Screenshot from 2021-06-02 14-46-54] https://user-images.githubusercontent.com/6670911/120482652-9d436700-c3b1-11eb-943c-690dcb8a3c2f.png

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/gogins/csound-extended/issues/164#issuecomment-852999261, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQIGJJ57G4SLEHWRB7KMMTTQYSHJANCNFSM44EO76SA .

tjingboem commented 3 years ago
Elapsed time at end of performance: real: 63.538s, CPU: 194.477s
512 2048 sample blks of floats written to /media/menno/datae/csounddata/Output/test4.wav (WAV)
menno@mennoASUSZ170 ~ $ csound /media/menno/datae/Downloads/csound-extended-develop/test-examples/csound/xanadu-high-resolution.csd -j4 -o test4.wav

This is done on my hard disk

Elapsed time at end of performance: real: 63.394s, CPU: 193.310s
512 2048 sample blks of floats written to /home/menno/test5.wav (WAV)
menno@mennoASUSZ170 ~ $ csound xanadu-high-resolution.csd -j4 -o /home/menno/test5.wav

And this from SSD There is no difference, so harddisk vs. SSD is not an influence. (Stupid question, but did you check to see if your blazing fast machine actually produces the wav? )

gogins commented 3 years ago

It is not a stupid question. I just did a fresh clone of the csound repository, and a fresh build, and got this:

Elapsed time at end of performance: real: 2.272s, CPU: 2.256s
512 2048 sample blks of floats written to test.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -otest.wav

versus

Elapsed time at end of performance: real: 1.378s, CPU: 3.844s
512 2048 sample blks of floats written to test-j4.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -j4 -otest-j4.wav

The file is indeed produced and sounds fine.

tjingboem commented 3 years ago

your results are 50 x faster...maybe we should have a poll so more Csound users can send in their results with xanadu-high resolution.

gogins commented 3 years ago

I repeated the fresh build with the default CFLAGS and now get:

Elapsed time at end of performance: real: 3.120s, CPU: 10.195s
512 2048 sample blks of floats written to test-j4.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -j4 -otest-j4.wav

versus

Elapsed time at end of performance: real: 7.811s, CPU: 7.779s
512 2048 sample blks of floats written to test.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -otest.wav

This is indeed very strange.

gogins commented 3 years ago

By the way, when I do these tests I run this script to ensure consistent results. It removes the installed Csound first of all, and removes all soundfiles in the build directory where I do my tests.

#!/bin/bash
sudo -k make uninstall
rm -f CMakeCache.txt
cmake .. -DCMAKE_C_FLAGS="-march='native' -Ofast -Wno-error=stringop-truncation -Wno-error=format-truncation -g -Wno-error=deprecated-declarations" -DCMAKE_BUILD_TYPE=RelWithDebug -DFLUIDSYNTH_LIBRARIES:FILEPATH=/usr/local/lib64/libfluidsynth.so 
# cmake .. -DCMAKE_C_FLAGS="-Wno-error=stringop-truncation -Wno-error=format-truncation -g -Wno-error=deprecated-declarations" -DCMAKE_BUILD_TYPE=RelWithDebug -DFLUIDSYNTH_LIBRARIES:FILEPATH=/usr/local/lib64/libfluidsynth.so 
make clean
rm *.wav
rm *.jar
echo "Before make..."
ls -ll
make VERBOSE=1 -j6
echo "After make..."
ls -ll *.so
ls -ll csound
sudo -k make install
sudo -k ldconfig
gogins commented 3 years ago

I'm going to go a year back and see what happens. If there is a performance difference I will try git bisect again.

tjingboem commented 3 years ago
Elapsed time at end of performance: real: 40.104s, CPU: 39.905s
menno@mennoASUSZ170 ~ $ csound xanadu-high-resolution.csd -o /home/menno/test7.wav

and

Elapsed time at end of performance: real: 61.919s, CPU: 186.789s
512 2048 sample blks of floats written to /home/menno/test8.wav (WAV)
menno@mennoASUSZ170 ~ $ csound xanadu-high-resolution.csd -j4 -o /home/menno/test8.wav

i used your script to uninstall the old Csound (from yesterday) and it installed the new Csound

It is a bit faster now when not using -j4, using with -j4 it is slower

And i must add that it is quite busy when allocating instruments:

new alloc for instr 1:
new alloc for instr 3:
new alloc for instr 3:
new alloc for instr 3:
new alloc for instr 3:
new alloc for instr 3:
new alloc for instr 3:
B  0.000 ..  0.100 T  0.100 TT  0.100 M:   2017.5   2013.5
new alloc for instr 1:
B  0.100 ..  0.200 T  0.200 TT  0.200 M:   5897.1   3925.8
new alloc for instr 1:
B  0.200 ..  0.300 T  0.300 TT  0.300 M:   8129.4   5157.1
new alloc for instr 1:
B  0.300 ..  0.400 T  0.400 TT  0.400 M:  11220.9   7520.4
new alloc for instr 1:
B  0.400 ..  0.500 T  0.500 TT  0.500 M:  11970.9   8076.2
new alloc for instr 1:
B  0.500 ..  7.500 T  7.500 TT  7.500 M:  12928.3  16287.0
new alloc for instr 2:
new alloc for instr 3:
new alloc for instr 3:
new alloc for instr 3:
new alloc for instr 3:
new alloc for instr 3:
new alloc for instr 3:
B  7.500 ..  7.600 T  7.600 TT  7.600 M:   6159.6   6422.8
new alloc for instr 2:

it just pauzes a few seconds each time the 'new alloc for instr x' appears...perhaps that might be a hint.

gogins commented 3 years ago

Thanks for the information! If it pauses in any noticeable way at all, that sure is a hint. I will investigate.

tjingboem commented 3 years ago

https://user-images.githubusercontent.com/6670911/120529454-d7762e00-c3dc-11eb-9bed-756ac88a1617.mp4

gogins commented 3 years ago

I just repeated my test on my Asus standalone:

Elapsed time at end of performance: real: 2.136s, CPU: 6.462s
512 2048 sample blks of floats written to test-j4.wav (WAV)
mkg@Sun-Yuong:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -j4 -otest-j4.wav

versus

Elapsed time at end of performance: real: 3.336s, CPU: 3.333s
512 2048 sample blks of floats written to test.wav (WAV)
mkg@Sun-Yuong:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -otest.wav

Not as good because the Asus has only 2 cores, but still, a speedup and not a slowdown with -j4.

gogins commented 3 years ago

Now I have reverted on the Asus to v6.15.0 (commit 18c2c7897425f462b9a7743cee157cb410c88198).

Elapsed time at end of performance: real: 2.180s, CPU: 6.453s
512 2048 sample blks of floats written to test-j4.wav (WAV)
mkg@Sun-Yuong:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -j4 -otest-j4.wav

Same ballpark.

tjingboem commented 3 years ago

i think this is crazy. How can my faster machine produce much slower results than both yours. And we both use Linux and gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0. This is not acceptable. I'm happy for you that your having great results. And that it IS possible- the proof is there. I want that too. And i think these results should be more or less equal for all users- across the different platforms.

A bit of a rant now: This could be a great selling point for Csound being around for so long and still contemporain. Showing that Csound 'being old' is not equal as being sluggish and outdated. We need Csound to run equally fast on the same machine running different platforms. This should be prioritized as a community effort.

Perhaps there is a process on my machine running on the background that i don't know about? O mystery of mysteries...

gogins commented 3 years ago

I have attached my version of xanadu-high-resolution.csd. Please note that in this version, ksmps is set to 128. xanadu-high-resolution.csd.txt

Also check your .csoundrc. I have removed .csoundrc from my system, you should move or remove this file before testing.

Changing ksmps to 1 as in the original version of xanadu-high-resolution.csd gives:

Elapsed time at end of performance: real: 37.775s, CPU: 114.917s
512 2048 sample blks of floats written to test-j4.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound xanadu-high-resolution.csd -j4 -otest-j4.wav

versus

Elapsed time at end of performance: real: 17.625s, CPU: 17.577s
512 2048 sample blks of floats written to test.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound xanadu-high-resolution.csd -otest.wav

I should have mentioned ksmps should be 128 before, sorry about that!

This behavior is to be expected. There is 128 times as much threading overhead with ksmps = 1. Please try again with ksmps = 128.

tjingboem commented 3 years ago

Ha! That sure is different than the csd i was using. Mine had

sr          =           88200
ksmps       =           1

and the csd you propose to test is not high resolution...hence the confusion.

This changes things, now that we are on the same page:

Elapsed time at end of performance: real: 5.470s, CPU: 5.409s
512 2048 sample blks of floats written to /media/menno/datae/csounddata/Output/test12.csd (WAV)
menno@mennoASUSZ170 /media/menno/datae/onderzoek/multicore $ csound xanaduHR.csd -o test12.csd

and

Elapsed time at end of performance: real: 1.999s, CPU: 6.782s
512 2048 sample blks of floats written to /media/menno/datae/csounddata/Output/test13.csd (WAV)
menno@mennoASUSZ170 /media/menno/datae/onderzoek/multicore $ csound xanaduHR.csd -j4 -o test13.csd

i also see that higher resolution and/or lower ksmps values has a big impact on where it is interesting to choose for multicore or singlecore...on my machine anayway

gogins commented 3 years ago

OK, this is really good to know, I think we have got to the bottom of this. I will just take a look at changing only the sample rate in my example and see if that does anything. If there are no further surprises I will close the issue.

In this piece "high resolution" also includes other things like using bigger wavetables, more precise oscillators, and arate envelopes.


Michael Gogins Irreducible Productions http://michaelgogins.tumblr.com Michael dot Gogins at gmail dot com

On Thu, Jun 3, 2021 at 8:17 AM Menno Knevel @.***> wrote:

Ha! That sure is different than the csd i was using. Mine had

sr = 88200 ksmps = 1

and the csd you propose to test is not high resolution...hence the confusion.

This changes things, now that we are on the same page:

Elapsed time at end of performance: real: 5.470s, CPU: 5.409s 512 2048 sample blks of floats written to /media/menno/datae/csounddata/Output/test12.csd (WAV) @.***ASUSZ170 /media/menno/datae/onderzoek/multicore $ csound xanaduHR.csd -o test12.csd

and

Elapsed time at end of performance: real: 1.999s, CPU: 6.782s 512 2048 sample blks of floats written to /media/menno/datae/csounddata/Output/test13.csd (WAV) @.***ASUSZ170 /media/menno/datae/onderzoek/multicore $ csound xanaduHR.csd -j4 -o test13.csd

i also see that higher resolution and/or lower ksmps values has a big impact on where it is interesting to choose for multicore or singlecore...on my machine anayway

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/gogins/csound-extended/issues/164#issuecomment-853824481, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQIGJPVMLDPMH3PDBI6HD3TQ5XGFANCNFSM44EO76SA .

gogins commented 3 years ago

Changing only sr (from 48000 to 96000):

Elapsed time at end of performance: real: 4.099s, CPU: 4.093s
512 2048 sample blks of floats written to test.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound xanadu-high-resolution.csd -otest.wav

versus

Elapsed time at end of performance: real: 2.449s, CPU: 6.812s
512 2048 sample blks of floats written to test-j4.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound xanadu-high-resolution.csd -j4 -otest-j4.wav

There is still a decent speedup.

gogins commented 3 years ago

I am closing this issue. It seems that the -j option is behaving as specified, and in accordance with general experience for complex multithreaded code, i.e. speed increases roughly 50% for every doubling of cores.