status-im / nimbus-eth2

Nim implementation of the Ethereum Beacon Chain
https://nimbus.guide
Other
523 stars 227 forks source link

GCC-14 issues fix #6332

Closed cheatfate closed 2 months ago

github-actions[bot] commented 3 months ago

Unit Test Results

         6 files       880 suites   18m 42s :stopwatch:   4 986 tests   4 638 :heavy_check_mark: 348 :zzz: 0 :x: 13 859 runs  13 503 :heavy_check_mark: 356 :zzz: 0 :x:

Results for commit ead7a0e8.

tersec commented 3 months ago

The specific error currently appears to be the Linux max stack checking:

gcc  @nimcache/release/all_tests/all_tests.makefile.linkerArgs
/github-runner/workspace/nimbus-eth2/nimbus-eth2/vendor/nim-chronos/chronos/internal/asyncmacro.nim: In function ‘_ZN16readChunkPayload61readChunkPayload__OOZbeacon95chainZsyncZsync95protocol_u13995E3refIN7futures26FutureBasecolonObjectType_EE’:
/github-runner/workspace/nimbus-eth2/nimbus-eth2/vendor/nim-chronos/chronos/internal/asyncmacro.nim:439:15: error: stack usage might be 1321040 bytes [-Werror=stack-usage=]
  439 |       iteratorNameSym = genSym(nskIterator, $prcName)
      |               ^
lto1: some warnings being treated as errors

Presumably one of the submodule bumps, or other changes, is triggering this; might be it be reasonable to bump the submodules not triggering said error to get fleet testing sooner and allow the specific causes of this to be narrowed down, or are there deeper dependencies between/across, say, nim-nat-traversal and nim-blscurve so it's important to bump those together?

Specific gcc 14 motivation aside, having clean submodule bumps is often better anyway.

cheatfate commented 3 months ago

Of course its important because all this packages had gcc-14 issues and all of them was fixed. This problem is very interesting i have installed Ubuntu 22.04.4 VM and tried to run our test suite in there with same gcc version and i have not got this error. Probably its because we assume that we will have default Linux stack size which is 8192 KBytes, and did not expect it to have only 1,321,040 bytes.

cheatfate commented 3 months ago

One more note, i have set stack size to 1024KBytes and was able to compile and run tests on Ubuntu 22.04.4 with gcc-11.

arnetheduck commented 3 months ago

See https://github.com/status-im/nimbus-eth2/blob/d2a07514541ffe6ee02a2ec7272ce7a315131e04/config.nims#L68 which enables compile-time stack size checking

tersec commented 3 months ago

Of course its important because all this packages had gcc-14 issues and all of them was fixed.

Yes, everything in this PR is important, for that reason. But the question is different. It's that there are some different orders in which the parts of this PR could be merged: (a) all at once, e.g., this PR; (b) nim-nat-traversal, then separately other things; (c) nim-blscurve, then separately other things; etc

One possibility is that the max stack usage calculation is being incrementally incremented by each bump or nimbus-eth2 change, so it'd end up being sort random, i.e. it wouldn't matter which submodule bump order was used, it'd be the first 3 which would look okay, and after that it'd trigger the CI build issue, but it could also be that one particular change in a submodule or in nimbus-eth2 is triggering this.

The point is, how detachable/separable are these changes. Of course it (to use one of the two arbitrary example submodules) nim-nat-traversal or nim-blscurve on its own won't resolve the gcc 14 issue in this repo, that's not the point. It's whether there specific harm in doing certain submodule bumps cleanly, separately from this PR, even if they don't in themselves 100% solve the gcc 14 issue, i.e. the "deeper dependencies" question/prompt.

They look mostly independent to me, and there have been situations in the past where we tried these large combined bumps, they caused problems, and then had to be selectively rolled back.

This problem is very interesting i have installed Ubuntu 22.04.4 VM and tried to run our test suite in there with same gcc version and i have not got this error. Probably its because we assume that we will have default Linux stack size which is 8192 KBytes, and did not expect it to have only 1,321,040 bytes.

See @arnetheduck 's comment, the stack size check is a deliberate CI thing, below what Linux by default supports. It's still a canary/signal worth noting.

cheatfate commented 3 months ago

That's interesting that we are doing this

if defined(limitStackUsage):
  # This limits stack usage of each individual function to 1MB - the option is
  # available on some GCC versions but not all - run with `-d:limitStackUsage`
  # and look for .su files in "./build/", "./nimcache/" or $TMPDIR that list the
  # stack size of each function.
  switch("passC", "-fstack-usage -Werror=stack-usage=1048576")
  switch("passL", "-fstack-usage -Werror=stack-usage=1048576")

To limit our stack usage to 1024KB but for some reason in next line for Windows we doing this

switch("passL", "-Wl,--stack,8388608")

Could somebody explain me our plan about stack size?

tersec commented 3 months ago

They're distinct things, as the slightly broader context elucidates: https://github.com/status-im/nimbus-eth2/blob/d2a07514541ffe6ee02a2ec7272ce7a315131e04/config.nims#L68-L80

The 1MB limit comes from https://github.com/status-im/nimbus-eth2/pull/3055 which was a response to multi-MB maximum static stack sizes (https://github.com/status-im/nimbus-eth2/pull/3049) and sometimes resulting stack overflows at runtime.

So the 1MB limit is when specifically limitStackUsage is defined for CI purposes, while the 8MB limit on Windows (https://learn.microsoft.com/en-us/windows/win32/procthread/thread-stack-size says the default is 1MB) comes a couple of years earlier from https://github.com/status-im/nimbus-eth2/pull/248 to fix https://github.com/status-im/nimbus-eth2/issues/78 and https://github.com/status-im/nimbus-eth2/issues/79 where this had proven empirically to be a problem. Why 8MB specifically was chosen, not sure, would guess it would have been to match Linux.

In general, these were done separately for separate reasons. I'd reconstruct this as probably:

arnetheduck commented 3 months ago

-fstack-usage is a per-function limit so we need more space for the full stack

tersec commented 2 months ago

Is this PR still relevant?

cheatfate commented 2 months ago

This PR was split into parts and merged.

mratsim commented 2 months ago

Why 8MB specifically was chosen, not sure, would guess it would have been to match Linux.

Yes, it was easier if somehow Nimbus overflowed 8MB to have the error happening consistently on all platforms.