Closed gogins closed 3 years ago
mkg@xenakis:~/csound-examples/csd$ csound trapped-high-resolution.csd -otest.wav
Elapsed time at end of performance: real: 28.720s, CPU: 28.591s
512 2048 sample blks of floats written to test.wav (WAV)
mkg@xenakis:~/csound-examples/csd$ csound trapped-high-resolution.csd -otest.wav -j4
Elapsed time at end of performance: real: 236.495s, CPU: 617.696s
512 2048 sample blks of floats written to test.wav (WAV)
This is certainly atrocious.
Elapsed time at end of performance: real: 29.181s, CPU: 29.049s
512 2048 sample blks of floats written to test.wav (WAV)
mkg@xenakis:~/csound-examples/csd$ csound cloud-strata.csd -otest.wav
mkg@xenakis:~/csound-examples/csd$ csound cloud-strata.csd -otest.wav -j4
Elapsed time at end of performance: real: 16.318s, CPU: 47.446s
512 2048 sample blks of floats written to test.wav (WAV)
This is roughly what the Wikipedia article claims.
My CFLAGS:
mkg@xenakis:~/csound-examples/csd$ csound trapped-high-resolution.csd -otest.wav
Elapsed time at end of performance: real: 28.310s, CPU: 28.228s
512 2048 sample blks of floats written to test.wav (WAV)
Standard CFLAGS:
mkg@xenakis:~/csound-examples/csd$ csound trapped-high-resolution.csd -otest.wav
Elapsed time at end of performance: real: 52.184s, CPU: 52.056s
512 2048 sample blks of floats written to test.wav (WAV)
I'm going to profile this again and see what is happening.
I have profiled. The dag_get_task
function is the main source of overhead (roughy half!) in multi-threaded rendering and in particular within that, calls to ATOMIC_READ, ATOMIC_WRITE, and ATOMIC_CAS. I do not (yet?) understand this code.
I have rewritten dag_get_task
in my preferred textbook style:
Elapsed time at end of performance: real: 226.502s, CPU: 602.506s
512 2048 sample blks of floats written to temp.wav (WAV)
mkg@xenakis:~/csound-examples/csd$ csound trapped-high-resolution.csd -j4 -otemp.wav
This provides a modest speedup.
Not sure what happened here but:
Elapsed time at end of performance: real: 1.377s, CPU: 3.826s
512 2048 sample blks of floats written to test.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -j4 -otest.wav
versus
Elapsed time at end of performance: real: 2.232s, CPU: 2.228s
512 2048 sample blks of floats written to test.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -otest.wav
i have never had better results with -j4 or -j8 so i never say the computer to do multicore with Csound. There are a few papers on the subject, and i collected them on my webpage: http://www.geluidsmanvanhetnoorden.nl/index.php?p=1_7_Sound-Matters
It seems that gluing the data streams together again after being processed with multicore, takes a lot of time...
On Tue, Jun 1, 2021 at 3:54 PM Michael Gogins @.***> wrote:
Not sure what happened here but:
Elapsed time at end of performance: real: 1.377s, CPU: 3.826s 512 2048 sample blks of floats written to test.wav (WAV) vst3_host_t::~vst3_host_t. @.***:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -j4 -otest.wav
versus
Elapsed time at end of performance: real: 2.232s, CPU: 2.228s 512 2048 sample blks of floats written to test.wav (WAV) vst3_host_t::~vst3_host_t. @.***:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -otest.wav
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gogins/csound-extended/issues/164#issuecomment-852144142, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABS4UP7AFHXNHM3DQDQSVWDTQTQ7XANCNFSM44EO76SA .
As you see I am now getting much more favorable results in my tests. I will try to find out why. I don't see anything obvious in recent changes to Csound code.
Elapsed time at end of performance: real: 28.412s, CPU: 28.316s
512 2048 sample blks of floats written to test.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/cloud-strata.csd -otest.wav
versus
Elapsed time at end of performance: real: 15.196s, CPU: 43.758s
512 2048 sample blks of floats written to test.wavq (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/cloud-strata.csd -j4 -otest.wav
That looks interesting. Your CFLAGS seems to do something for realtime use.
Without -march=native -Ofast
:
Elapsed time at end of performance: real: 3.327s, CPU: 10.916s
512 2048 sample blks of floats written to test.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -j4 -otest.wav
versus:
Elapsed time at end of performance: real: 7.745s, CPU: 7.726s
512 2048 sample blks of floats written to test.wavq (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -otest.wavq
This is still far better than we originally saw. I will try git bisect to find out what commit improved things so much.
I have done the git bisect without learning anything. I started with commit 670d093eb1d04bd7eb510af83d597473300546fd which is May 2, my first tests were May 5.
Performance is consistently good in all commits and again far better than my first tests.
I must be missing something. It is logically possible that the cause is some improvement in the operating system or the C library but that does not seem likely to me. I have recently updated Ubuntu to version 20.04.2 and also updated the entire "C" toolchain.
In any event I can no longer reproduce the problem. The -j
option appears to be performing as advertised.
One more thing to check, I had done this:
diff --git a/include/csoundCore.h b/include/csoundCore.h
index d18318419..3312524e4 100644
--- a/include/csoundCore.h
+++ b/include/csoundCore.h
@@ -1774,7 +1774,7 @@ typedef struct _message_queue_t_ {
int dag_changed;
int dag_num_active;
INSDS **dag_task_map;
- volatile stateWithPadding *dag_task_status;
+ stateWithPadding *dag_task_status;
watchList * volatile *dag_task_watch;
watchList *dag_wlmm;
char **dag_task_dep;
I am reverting that now... makes no difference.
--Csound version 6.16 (double samples) Jun 1 2021
csound /media/menno/datae/Downloads/csound-extended-develop/test-examples/csound/xanadu-high-resolution.csd -otest.wav
Elapsed time at end of performance: real: 41.762s, CPU: 40.291s
csound /media/menno/datae/Downloads/csound-extended-develop/test-examples/csound/xanadu-high-resolution.csd -j8 -otest2.wav
Elapsed time at end of performance: real: 102.826s, CPU: 432.621s
menno@mennoASUSZ170 ~ $ gcc --version gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Thanks for the information. I'm really mystified.
On Wed, Jun 2, 2021, 07:53 Menno Knevel @.***> wrote:
--Csound version 6.16 (double samples) Jun 1 2021
csound /media/menno/datae/Downloads/csound-extended-develop/test-examples/csound/xanadu-high-resolution.csd -otest.wav Elapsed time at end of performance: real: 41.762s, CPU: 40.291s
csound /media/menno/datae/Downloads/csound-extended-develop/test-examples/csound/xanadu-high-resolution.csd -j8 -otest2.wav Elapsed time at end of performance: real: 102.826s, CPU: 432.621s
@.***ASUSZ170 ~ $ gcc --version gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/gogins/csound-extended/issues/164#issuecomment-852963012, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQIGJORNAUDEC5VCKQXRQDTQYLSTANCNFSM44EO76SA .
your computer is 10x faster...can i buy it :-) ?
If you do not have more than 1 core, your results make sense.
My computer is an Intel NUC, a small computer about the size of a paperback book. It has 4 cores and a solid state disk, and plenty of RAM. It is somewhat faster than average for a contemporary personal computer, but gaming computers and workstations can be significantly faster.
What is your operating system vendor and version, what is the model of your computer?
On Wed, Jun 2, 2021, 08:22 Menno Knevel @.***> wrote:
your computer is 10x faster...can i buy it :-) ?
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/gogins/csound-extended/issues/164#issuecomment-852981220, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQIGJMPG7VVFCRRC7HLL7TTQYPBBANCNFSM44EO76SA .
i thought i had a decent desktop, and it is not very old...
Your machine should indeed be faster than mine, do you have a solid state disk?
Also, try again with -j4, not -j8. You really only have 4 cores. Intel has this feature called "hyper-threading" where to some extent one core can run two threads, but I don't think it helps us here.
Michael Gogins Irreducible Productions http://michaelgogins.tumblr.com Michael dot Gogins at gmail dot com
On Wed, Jun 2, 2021 at 8:49 AM Menno Knevel @.***> wrote:
i thought i had a decent desktop, and it is not very old...
[image: Screenshot from 2021-06-02 14-46-54] https://user-images.githubusercontent.com/6670911/120482652-9d436700-c3b1-11eb-943c-690dcb8a3c2f.png
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/gogins/csound-extended/issues/164#issuecomment-852999261, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQIGJJ57G4SLEHWRB7KMMTTQYSHJANCNFSM44EO76SA .
Elapsed time at end of performance: real: 63.538s, CPU: 194.477s
512 2048 sample blks of floats written to /media/menno/datae/csounddata/Output/test4.wav (WAV)
menno@mennoASUSZ170 ~ $ csound /media/menno/datae/Downloads/csound-extended-develop/test-examples/csound/xanadu-high-resolution.csd -j4 -o test4.wav
This is done on my hard disk
Elapsed time at end of performance: real: 63.394s, CPU: 193.310s
512 2048 sample blks of floats written to /home/menno/test5.wav (WAV)
menno@mennoASUSZ170 ~ $ csound xanadu-high-resolution.csd -j4 -o /home/menno/test5.wav
And this from SSD There is no difference, so harddisk vs. SSD is not an influence. (Stupid question, but did you check to see if your blazing fast machine actually produces the wav? )
It is not a stupid question. I just did a fresh clone of the csound repository, and a fresh build, and got this:
Elapsed time at end of performance: real: 2.272s, CPU: 2.256s
512 2048 sample blks of floats written to test.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -otest.wav
versus
Elapsed time at end of performance: real: 1.378s, CPU: 3.844s
512 2048 sample blks of floats written to test-j4.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -j4 -otest-j4.wav
The file is indeed produced and sounds fine.
your results are 50 x faster...maybe we should have a poll so more Csound users can send in their results with xanadu-high resolution.
I repeated the fresh build with the default CFLAGS and now get:
Elapsed time at end of performance: real: 3.120s, CPU: 10.195s
512 2048 sample blks of floats written to test-j4.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -j4 -otest-j4.wav
versus
Elapsed time at end of performance: real: 7.811s, CPU: 7.779s
512 2048 sample blks of floats written to test.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -otest.wav
This is indeed very strange.
By the way, when I do these tests I run this script to ensure consistent results. It removes the installed Csound first of all, and removes all soundfiles in the build directory where I do my tests.
#!/bin/bash
sudo -k make uninstall
rm -f CMakeCache.txt
cmake .. -DCMAKE_C_FLAGS="-march='native' -Ofast -Wno-error=stringop-truncation -Wno-error=format-truncation -g -Wno-error=deprecated-declarations" -DCMAKE_BUILD_TYPE=RelWithDebug -DFLUIDSYNTH_LIBRARIES:FILEPATH=/usr/local/lib64/libfluidsynth.so
# cmake .. -DCMAKE_C_FLAGS="-Wno-error=stringop-truncation -Wno-error=format-truncation -g -Wno-error=deprecated-declarations" -DCMAKE_BUILD_TYPE=RelWithDebug -DFLUIDSYNTH_LIBRARIES:FILEPATH=/usr/local/lib64/libfluidsynth.so
make clean
rm *.wav
rm *.jar
echo "Before make..."
ls -ll
make VERBOSE=1 -j6
echo "After make..."
ls -ll *.so
ls -ll csound
sudo -k make install
sudo -k ldconfig
I'm going to go a year back and see what happens. If there is a performance difference I will try git bisect again.
Elapsed time at end of performance: real: 40.104s, CPU: 39.905s
menno@mennoASUSZ170 ~ $ csound xanadu-high-resolution.csd -o /home/menno/test7.wav
and
Elapsed time at end of performance: real: 61.919s, CPU: 186.789s
512 2048 sample blks of floats written to /home/menno/test8.wav (WAV)
menno@mennoASUSZ170 ~ $ csound xanadu-high-resolution.csd -j4 -o /home/menno/test8.wav
i used your script to uninstall the old Csound (from yesterday) and it installed the new Csound
It is a bit faster now when not using -j4, using with -j4 it is slower
And i must add that it is quite busy when allocating instruments:
new alloc for instr 1:
new alloc for instr 3:
new alloc for instr 3:
new alloc for instr 3:
new alloc for instr 3:
new alloc for instr 3:
new alloc for instr 3:
B 0.000 .. 0.100 T 0.100 TT 0.100 M: 2017.5 2013.5
new alloc for instr 1:
B 0.100 .. 0.200 T 0.200 TT 0.200 M: 5897.1 3925.8
new alloc for instr 1:
B 0.200 .. 0.300 T 0.300 TT 0.300 M: 8129.4 5157.1
new alloc for instr 1:
B 0.300 .. 0.400 T 0.400 TT 0.400 M: 11220.9 7520.4
new alloc for instr 1:
B 0.400 .. 0.500 T 0.500 TT 0.500 M: 11970.9 8076.2
new alloc for instr 1:
B 0.500 .. 7.500 T 7.500 TT 7.500 M: 12928.3 16287.0
new alloc for instr 2:
new alloc for instr 3:
new alloc for instr 3:
new alloc for instr 3:
new alloc for instr 3:
new alloc for instr 3:
new alloc for instr 3:
B 7.500 .. 7.600 T 7.600 TT 7.600 M: 6159.6 6422.8
new alloc for instr 2:
it just pauzes a few seconds each time the 'new alloc for instr x' appears...perhaps that might be a hint.
Thanks for the information! If it pauses in any noticeable way at all, that sure is a hint. I will investigate.
I just repeated my test on my Asus standalone:
Elapsed time at end of performance: real: 2.136s, CPU: 6.462s
512 2048 sample blks of floats written to test-j4.wav (WAV)
mkg@Sun-Yuong:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -j4 -otest-j4.wav
versus
Elapsed time at end of performance: real: 3.336s, CPU: 3.333s
512 2048 sample blks of floats written to test.wav (WAV)
mkg@Sun-Yuong:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -otest.wav
Not as good because the Asus has only 2 cores, but still, a speedup and not a slowdown with -j4.
Now I have reverted on the Asus to v6.15.0 (commit 18c2c7897425f462b9a7743cee157cb410c88198).
Elapsed time at end of performance: real: 2.180s, CPU: 6.453s
512 2048 sample blks of floats written to test-j4.wav (WAV)
mkg@Sun-Yuong:~/csound/build$ csound ~/csound-examples/csd/xanadu-high-resolution.csd -j4 -otest-j4.wav
Same ballpark.
i think this is crazy. How can my faster machine produce much slower results than both yours. And we both use Linux and gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0. This is not acceptable. I'm happy for you that your having great results. And that it IS possible- the proof is there. I want that too. And i think these results should be more or less equal for all users- across the different platforms.
A bit of a rant now: This could be a great selling point for Csound being around for so long and still contemporain. Showing that Csound 'being old' is not equal as being sluggish and outdated. We need Csound to run equally fast on the same machine running different platforms. This should be prioritized as a community effort.
Perhaps there is a process on my machine running on the background that i don't know about? O mystery of mysteries...
I have attached my version of xanadu-high-resolution.csd. Please note that in this version, ksmps is set to 128. xanadu-high-resolution.csd.txt
Also check your .csoundrc. I have removed .csoundrc from my system, you should move or remove this file before testing.
Changing ksmps to 1 as in the original version of xanadu-high-resolution.csd gives:
Elapsed time at end of performance: real: 37.775s, CPU: 114.917s
512 2048 sample blks of floats written to test-j4.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound xanadu-high-resolution.csd -j4 -otest-j4.wav
versus
Elapsed time at end of performance: real: 17.625s, CPU: 17.577s
512 2048 sample blks of floats written to test.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound xanadu-high-resolution.csd -otest.wav
I should have mentioned ksmps should be 128 before, sorry about that!
This behavior is to be expected. There is 128 times as much threading overhead with ksmps = 1. Please try again with ksmps = 128.
Ha! That sure is different than the csd i was using. Mine had
sr = 88200
ksmps = 1
and the csd you propose to test is not high resolution...hence the confusion.
This changes things, now that we are on the same page:
Elapsed time at end of performance: real: 5.470s, CPU: 5.409s
512 2048 sample blks of floats written to /media/menno/datae/csounddata/Output/test12.csd (WAV)
menno@mennoASUSZ170 /media/menno/datae/onderzoek/multicore $ csound xanaduHR.csd -o test12.csd
and
Elapsed time at end of performance: real: 1.999s, CPU: 6.782s
512 2048 sample blks of floats written to /media/menno/datae/csounddata/Output/test13.csd (WAV)
menno@mennoASUSZ170 /media/menno/datae/onderzoek/multicore $ csound xanaduHR.csd -j4 -o test13.csd
i also see that higher resolution and/or lower ksmps values has a big impact on where it is interesting to choose for multicore or singlecore...on my machine anayway
OK, this is really good to know, I think we have got to the bottom of this. I will just take a look at changing only the sample rate in my example and see if that does anything. If there are no further surprises I will close the issue.
In this piece "high resolution" also includes other things like using bigger wavetables, more precise oscillators, and arate envelopes.
Michael Gogins Irreducible Productions http://michaelgogins.tumblr.com Michael dot Gogins at gmail dot com
On Thu, Jun 3, 2021 at 8:17 AM Menno Knevel @.***> wrote:
Ha! That sure is different than the csd i was using. Mine had
sr = 88200 ksmps = 1
and the csd you propose to test is not high resolution...hence the confusion.
This changes things, now that we are on the same page:
Elapsed time at end of performance: real: 5.470s, CPU: 5.409s 512 2048 sample blks of floats written to /media/menno/datae/csounddata/Output/test12.csd (WAV) @.***ASUSZ170 /media/menno/datae/onderzoek/multicore $ csound xanaduHR.csd -o test12.csd
and
Elapsed time at end of performance: real: 1.999s, CPU: 6.782s 512 2048 sample blks of floats written to /media/menno/datae/csounddata/Output/test13.csd (WAV) @.***ASUSZ170 /media/menno/datae/onderzoek/multicore $ csound xanaduHR.csd -j4 -o test13.csd
i also see that higher resolution and/or lower ksmps values has a big impact on where it is interesting to choose for multicore or singlecore...on my machine anayway
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/gogins/csound-extended/issues/164#issuecomment-853824481, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQIGJPVMLDPMH3PDBI6HD3TQ5XGFANCNFSM44EO76SA .
Changing only sr (from 48000 to 96000):
Elapsed time at end of performance: real: 4.099s, CPU: 4.093s
512 2048 sample blks of floats written to test.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound xanadu-high-resolution.csd -otest.wav
versus
Elapsed time at end of performance: real: 2.449s, CPU: 6.812s
512 2048 sample blks of floats written to test-j4.wav (WAV)
vst3_host_t::~vst3_host_t.
mkg@xenakis:~/csound/build$ csound xanadu-high-resolution.csd -j4 -otest-j4.wav
There is still a decent speedup.
I am closing this issue. It seems that the -j
option is behaving as specified, and in accordance with general experience for complex multithreaded code, i.e. speed increases roughly 50% for every doubling of cores.
Document and research current speedups and what might be done to improve them. May become a pull request to Csound proper.