Closed cheatfate closed 2 months ago
The specific error currently appears to be the Linux max stack checking:
gcc @nimcache/release/all_tests/all_tests.makefile.linkerArgs
/github-runner/workspace/nimbus-eth2/nimbus-eth2/vendor/nim-chronos/chronos/internal/asyncmacro.nim: In function ‘_ZN16readChunkPayload61readChunkPayload__OOZbeacon95chainZsyncZsync95protocol_u13995E3refIN7futures26FutureBasecolonObjectType_EE’:
/github-runner/workspace/nimbus-eth2/nimbus-eth2/vendor/nim-chronos/chronos/internal/asyncmacro.nim:439:15: error: stack usage might be 1321040 bytes [-Werror=stack-usage=]
439 | iteratorNameSym = genSym(nskIterator, $prcName)
| ^
lto1: some warnings being treated as errors
Presumably one of the submodule bumps, or other changes, is triggering this; might be it be reasonable to bump the submodules not triggering said error to get fleet testing sooner and allow the specific causes of this to be narrowed down, or are there deeper dependencies between/across, say, nim-nat-traversal
and nim-blscurve
so it's important to bump those together?
Specific gcc 14 motivation aside, having clean submodule bumps is often better anyway.
Of course its important because all this packages had gcc-14
issues and all of them was fixed.
This problem is very interesting i have installed Ubuntu 22.04.4 VM and tried to run our test suite in there with same gcc
version and i have not got this error. Probably its because we assume that we will have default Linux stack size which is 8192 KBytes, and did not expect it to have only 1,321,040 bytes.
One more note, i have set stack size to 1024KBytes
and was able to compile and run tests on Ubuntu 22.04.4 with gcc-11
.
See https://github.com/status-im/nimbus-eth2/blob/d2a07514541ffe6ee02a2ec7272ce7a315131e04/config.nims#L68 which enables compile-time stack size checking
Of course its important because all this packages had
gcc-14
issues and all of them was fixed.
Yes, everything in this PR is important, for that reason. But the question is different. It's that there are some different orders in which the parts of this PR could be merged:
(a) all at once, e.g., this PR;
(b) nim-nat-traversal
, then separately other things;
(c) nim-blscurve
, then separately other things;
etc
One possibility is that the max stack usage calculation is being incrementally incremented by each bump or nimbus-eth2
change, so it'd end up being sort random, i.e. it wouldn't matter which submodule bump order was used, it'd be the first 3 which would look okay, and after that it'd trigger the CI build issue, but it could also be that one particular change in a submodule or in nimbus-eth2
is triggering this.
The point is, how detachable/separable are these changes. Of course it (to use one of the two arbitrary example submodules) nim-nat-traversal
or nim-blscurve
on its own won't resolve the gcc 14 issue in this repo, that's not the point. It's whether there specific harm in doing certain submodule bumps cleanly, separately from this PR, even if they don't in themselves 100% solve the gcc 14 issue, i.e. the "deeper dependencies" question/prompt.
They look mostly independent to me, and there have been situations in the past where we tried these large combined bumps, they caused problems, and then had to be selectively rolled back.
This problem is very interesting i have installed Ubuntu 22.04.4 VM and tried to run our test suite in there with same
gcc
version and i have not got this error. Probably its because we assume that we will have default Linux stack size which is 8192 KBytes, and did not expect it to have only 1,321,040 bytes.
See @arnetheduck 's comment, the stack size check is a deliberate CI thing, below what Linux by default supports. It's still a canary/signal worth noting.
That's interesting that we are doing this
if defined(limitStackUsage):
# This limits stack usage of each individual function to 1MB - the option is
# available on some GCC versions but not all - run with `-d:limitStackUsage`
# and look for .su files in "./build/", "./nimcache/" or $TMPDIR that list the
# stack size of each function.
switch("passC", "-fstack-usage -Werror=stack-usage=1048576")
switch("passL", "-fstack-usage -Werror=stack-usage=1048576")
To limit our stack usage to 1024KB
but for some reason in next line for Windows we doing this
switch("passL", "-Wl,--stack,8388608")
Could somebody explain me our plan about stack size?
They're distinct things, as the slightly broader context elucidates: https://github.com/status-im/nimbus-eth2/blob/d2a07514541ffe6ee02a2ec7272ce7a315131e04/config.nims#L68-L80
The 1MB limit comes from https://github.com/status-im/nimbus-eth2/pull/3055 which was a response to multi-MB maximum static stack sizes (https://github.com/status-im/nimbus-eth2/pull/3049) and sometimes resulting stack overflows at runtime.
So the 1MB limit is when specifically limitStackUsage
is defined for CI purposes, while the 8MB limit on Windows (https://learn.microsoft.com/en-us/windows/win32/procthread/thread-stack-size says the default is 1MB) comes a couple of years earlier from https://github.com/status-im/nimbus-eth2/pull/248 to fix https://github.com/status-im/nimbus-eth2/issues/78 and https://github.com/status-im/nimbus-eth2/issues/79 where this had proven empirically to be a problem. Why 8MB specifically was chosen, not sure, would guess it would have been to match Linux.
In general, these were done separately for separate reasons. I'd reconstruct this as probably:
ref
object usage, etc), along with adding to CI some way of checking this with some engineering margin so that the static CI limit is notably less than the actual limit-fstack-usage
is a per-function limit so we need more space for the full stack
Is this PR still relevant?
This PR was split into parts and merged.
Why 8MB specifically was chosen, not sure, would guess it would have been to match Linux.
Yes, it was easier if somehow Nimbus overflowed 8MB to have the error happening consistently on all platforms.
Unit Test Results
6 files 880 suites 18m 42s :stopwatch: 4 986 tests 4 638 :heavy_check_mark: 348 :zzz: 0 :x: 13 859 runs 13 503 :heavy_check_mark: 356 :zzz: 0 :x:
Results for commit ead7a0e8.