Closed colluca closed 4 months ago
I tested this PR in our downstream docker image and testsuite (https://github.com/opencompl/Quidditch/blob/main/runtime/toolchain/Dockerfile) in part to evaluate whether this would then also close https://github.com/pulp-platform/snitch_cluster/pull/148.
Few things sadly happened as soon as I used the newer SHA with the newer verilator release:
snitch_cluster
are failing without yet using VLT_NUM_THREADS
. I haven not yet further looked into why. I might also be doing something wrong though as it seems to work on our executables. I also noticed that the log files have different names now (trace_hart_0000x.dasm
) so maybe this is also the cause.So the good news is that the new verilator model is a lot faster running a whole NN that leverages all 9 cores of a snitch cluster. Previously it took 5076 seconds; which has now been reduced to 2200 seconds, i.e. more than 50%.
I used VLT_NUM_THREADS=9
, thinking that is a reasonable inherent parallism of the verilator model but a more optimal one may exist (our machine have 16 threads to play with). For GitHub actions we would more likely scale this back to 4 at most probably.
The correctness issue in https://github.com/pulp-platform/snitch_cluster/pull/148 seems to have also been fixed
@zero9178 thanks for the detailed testing and report!
The last results look promising, so I would close https://github.com/pulp-platform/snitch_cluster/pull/148 in favour of this. Did you manage to test this with GCC 11.1 in the end? Or were you able to fix the previous errors?
Regarding the log names this should be due to https://github.com/pulp-platform/snitch_cluster/pull/58, but anyways the traces shouldn't affect the correctness of any of the software tests.
@zero9178 thanks for the detailed testing and report!
The last results look promising, so I would close #148 in favour of this. Did you manage to test this with GCC 11.1 in the end? Or were you able to fix the previous errors?
I sadly haven't gotten around to doing this yet and I am currently short on time, but will do so as soon as possible. (Hopefully by next week when we're in Zurich :slightly_smiling_face:)
I sadly haven't gotten around to doing this yet and I am currently short on time, but will do so as soon as possible. (Hopefully by next week when we're in Zurich 🙂)
Looking forward to meeting you and we can surely discuss this in that setting as well :)
I believe I have found and fixed the issue that affected Clang and potentially the newer GCC versions as well. It seemed to have been a stack overflow after all. Culprit are the files/functions that verilator designates as Slow and therefore compiled with -O0
. Since neither GCC nor Clang performaned any optimizations, code in these files had excessive stack usage. Adding OPT_SLOW="-O1"
to my make
invocation when building bin/snitch_cluster.vlt
fixes it and made the clang built verilator also pass every test in the repo.
I believe I have found and fixed the issue that affected Clang and potentially the newer GCC versions as well. It seemed to have been a stack overflow after all. Culprit are the files/functions that verilator designates as Slow and therefore compiled with
-O0
. Since neither GCC nor Clang performaned any optimizations, code in these files had excessive stack usage. AddingOPT_SLOW="-O1"
to mymake
invocation when buildingbin/snitch_cluster.vlt
fixes it and made the clang built verilator also pass every test in the repo.
Great news! :tada: I will proceed with the merge then :)
General
Details
--timing
to support timing constructs e.g. as will be present in iDMA tracer.VLT_JOBS
option to control the number of threads, useful to keep memory usage under control.github-ci.hjson
config with a reduced core count (4 cores) for use in the Github CI.default.hjson
, and not withfdiv.hjson
. In any case it's best to align all configurations to the default, so this is included as part of the PR.