Update Verilator to version 5.006

colluca commented 4 months ago

General

Addresses issue https://github.com/pulp-platform/snitch_cluster/issues/75.
Replaces https://github.com/pulp-platform/snitch_cluster/pull/76 after previously updating our Docker container to Ubuntu 22.04 in https://github.com/pulp-platform/snitch_cluster/pull/157.
Also attempts to build a multi-threaded Verilator model, fixing the issues observed in https://github.com/pulp-platform/snitch_cluster/pull/148.

Details

Add --timing to support timing constructs e.g. as will be present in iDMA tracer.
Add VLT_JOBS option to control the number of threads, useful to keep memory usage under control.
Verilator 5.006 seems to require more memory than 4.110, specifically >16GB also with a single job, causing the Github CI to fail. We attempted hierarchical Verilation on the Snitch core complex but this fails due to missing support for type parameters (see https://github.com/verilator/verilator/issues/5309). To solve this issue we provide the github-ci.hjson config with a reduced core count (4 cores) for use in the Github CI.
The out-of-memory failure occurred only with the default HW config default.hjson, and not with fdiv.hjson. In any case it's best to align all configurations to the default, so this is included as part of the PR.
Generalize all tests to work with an arbitrary number of cores (>4).
Temporarily disable ATAX test, to be fixed in a separate commit.

zero9178 commented 4 months ago

I tested this PR in our downstream docker image and testsuite (https://github.com/opencompl/Quidditch/blob/main/runtime/toolchain/Dockerfile) in part to evaluate whether this would then also close https://github.com/pulp-platform/snitch_cluster/pull/148.

Few things sadly happened as soon as I used the newer SHA with the newer verilator release:

The verilated model seems to crash immediately if compiled with clang even on the newest clang versions (tested 14 and 18.1). I suspected a stack overflow initially but it also crashes with an unlimited stack, so I am not sure. It would be a shame to loose clang as using clang previously cut our simulation time by 1/5. Applying PGO to the verilator build further reduced simulation time by a further 50% (tho this could also be done with GCC). I see the same behaviour in Verilator 5.026.
Even when building with GCC (we are running with GCC 12.1, tad newer than the GCC 11.1 in Ubuntu 22.04), half the tests in snitch_cluster are failing without yet using VLT_NUM_THREADS. I haven not yet further looked into why. I might also be doing something wrong though as it seems to work on our executables. I also noticed that the log files have different names now (trace_hart_0000x.dasm) so maybe this is also the cause.
Compiling it with 9 threads causes a bit of a slowdown on single threaded executables (not unexpected). I am running tests now that are mostly multi threaded (as in all cores of the cluster) workloads now.

zero9178 commented 4 months ago

So the good news is that the new verilator model is a lot faster running a whole NN that leverages all 9 cores of a snitch cluster. Previously it took 5076 seconds; which has now been reduced to 2200 seconds, i.e. more than 50%.

I used VLT_NUM_THREADS=9, thinking that is a reasonable inherent parallism of the verilator model but a more optimal one may exist (our machine have 16 threads to play with). For GitHub actions we would more likely scale this back to 4 at most probably.

The correctness issue in https://github.com/pulp-platform/snitch_cluster/pull/148 seems to have also been fixed

colluca commented 4 months ago

@zero9178 thanks for the detailed testing and report!

The last results look promising, so I would close https://github.com/pulp-platform/snitch_cluster/pull/148 in favour of this. Did you manage to test this with GCC 11.1 in the end? Or were you able to fix the previous errors?

Regarding the log names this should be due to https://github.com/pulp-platform/snitch_cluster/pull/58, but anyways the traces shouldn't affect the correctness of any of the software tests.

zero9178 commented 4 months ago

@zero9178 thanks for the detailed testing and report!

The last results look promising, so I would close #148 in favour of this. Did you manage to test this with GCC 11.1 in the end? Or were you able to fix the previous errors?

I sadly haven't gotten around to doing this yet and I am currently short on time, but will do so as soon as possible. (Hopefully by next week when we're in Zurich :slightly_smiling_face:)

colluca commented 4 months ago

I sadly haven't gotten around to doing this yet and I am currently short on time, but will do so as soon as possible. (Hopefully by next week when we're in Zurich 🙂)

Looking forward to meeting you and we can surely discuss this in that setting as well :)

zero9178 commented 4 months ago

I believe I have found and fixed the issue that affected Clang and potentially the newer GCC versions as well. It seemed to have been a stack overflow after all. Culprit are the files/functions that verilator designates as Slow and therefore compiled with -O0. Since neither GCC nor Clang performaned any optimizations, code in these files had excessive stack usage. Adding OPT_SLOW="-O1" to my make invocation when building bin/snitch_cluster.vlt fixes it and made the clang built verilator also pass every test in the repo.

colluca commented 4 months ago

I believe I have found and fixed the issue that affected Clang and potentially the newer GCC versions as well. It seemed to have been a stack overflow after all. Culprit are the files/functions that verilator designates as Slow and therefore compiled with -O0. Since neither GCC nor Clang performaned any optimizations, code in these files had excessive stack usage. Adding OPT_SLOW="-O1" to my make invocation when building bin/snitch_cluster.vlt fixes it and made the clang built verilator also pass every test in the repo.

Great news! :tada: I will proceed with the merge then :)

pulp-platform / snitch_cluster

Update Verilator to version 5.006 #158

General

Details