iTwin / itwinjs-core

Monorepo for iTwin.js Library
https://www.itwinjs.org
MIT License
600 stars 210 forks source link

Tests segfault on startup #5663

Closed pmconne closed 12 months ago

pmconne commented 1 year ago

Describe the bug rush cover in CI jobs is producing segmentation faults on mac and linux.

To Reproduce Steps to reproduce the behavior:

  1. Consult CI job results for #5661
  2. Observe multiple test suites producing seg faults on mac and linux.

Screenshots

`rush cover` sumary (linux)
``` ==[ FAILURE: 7 operations ]==================================================== --[ FAILURE: @itwin/analytical-backend ]---------------------[ 5.70 seconds ]-- Segmentation fault (core dumped) --[ FAILURE: @itwin/core-backend ]--------------------------[ 13.29 seconds ]-- Segmentation fault (core dumped) --[ FAILURE: @itwin/linear-referencing-backend ]-------------[ 5.74 seconds ]-- Segmentation fault (core dumped) --[ FAILURE: @itwin/physical-material-backend ]--------------[ 4.57 seconds ]-- Segmentation fault (core dumped) --[ FAILURE: core-full-stack-tests ]------------------------[ 37.85 seconds ]-- WARNING: Tests attempted to load missing asset: "/locales/en-US/iModelJs.json" WARNING: Tests attempted to load missing asset: "/locales/en-US/CoreTools.json" WARNING: Tests attempted to load missing asset: "/locales/en-US/Editor.json" WARNING: Tests attempted to load missing asset: "/locales/en/Editor.json" WARNING: Tests attempted to load missing asset: "/locales/en-US/TestApp.json" WARNING: Tests attempted to load missing asset: "/locales/en/TestApp.json" Segmentation fault (core dumped) --[ FAILURE: example-code-snippets ]-------------------------[ 4.01 seconds ]-- Segmentation fault (core dumped) --[ FAILURE: presentation-full-stack-tests ]-----------------[ 5.19 seconds ]-- Invoking: npm run -s test Backend PID: 19201 Default supplemental rules [2023-06-21T01:35:05.890Z] Tests initialized Content modifiers bis.Element Related properties Operations failed. rush cover (4 minutes 51.9 seconds) ##[error]Bash exited with code '1'. Finishing: rush cover ```
`rush cover` summary (macOS)
``` ==[ FAILURE: 8 operations ]==================================================== --[ FAILURE: @itwin/analytical-backend ]---------------------[ 3.68 seconds ]-- Invoking: nyc npm -s test AnalyticalSchema =============================== Coverage summary =============================== Statements : Unknown% ( 0/0 ) Branches : Unknown% ( 0/0 ) Functions : Unknown% ( 0/0 ) Lines : Unknown% ( 0/0 ) ================================================================================ --[ FAILURE: @itwin/core-backend ]---------------------------[ 8.20 seconds ]-- Invoking: nyc npm -s test Category =============================== Coverage summary =============================== Statements : Unknown% ( 0/0 ) Branches : Unknown% ( 0/0 ) Functions : Unknown% ( 0/0 ) Lines : Unknown% ( 0/0 ) ================================================================================ --[ FAILURE: @itwin/linear-referencing-backend ]-------------[ 2.67 seconds ]-- Invoking: nyc npm -s test LinearReferencing Domain =============================== Coverage summary =============================== Statements : Unknown% ( 0/0 ) Branches : Unknown% ( 0/0 ) Functions : Unknown% ( 0/0 ) Lines : Unknown% ( 0/0 ) ================================================================================ --[ FAILURE: @itwin/physical-material-backend ]--------------[ 2.78 seconds ]-- Invoking: nyc npm -s test PhysicalMaterialSchema =============================== Coverage summary =============================== Statements : Unknown% ( 0/0 ) Branches : Unknown% ( 0/0 ) Functions : Unknown% ( 0/0 ) Lines : Unknown% ( 0/0 ) ================================================================================ --[ FAILURE: core-full-stack-tests ]------------------------[ 20.89 seconds ]-- WARNING: Tests attempted to load missing asset: "/locales/en-US/iModelJs.json" WARNING: Tests attempted to load missing asset: "/locales/en-US/CoreTools.json" sh: line 1: 299 Segmentation fault: 11 npm run -s test:chrome --[ FAILURE: example-code-app ]------------------------------[ 4.88 seconds ]-- sh: line 1: 710 Segmentation fault: 11 mocha --no-config rush cover (3 minutes 45.5 seconds) ##[error]Bash exited with code '1'. Finishing: rush cover ```

Desktop (please complete the applicable information):

Additional context

Failing build pipeline. First observed in #5660. Occurred on all 3 runs of the pipeline.

5661 produces similar results, with no code changes vs master.

Each test suite crash without completing a single test. Only test suites that use @itwin/core-backend are affected. The most recent addon included upgrades of several third-party libraries. These failures were not observed at the time the new addon was integrated. I fail to reproduce the problem running rush cover on Ubuntu 22.04.

pmconne commented 1 year ago

@chuckkir @nick4598 any theories?

pmconne commented 1 year ago

Not observed in nightly job for 4.1.0-dev.40. Observed in nightly job for 4.1.0-dev.41 There were no commits to master in between those two.

pmconne commented 1 year ago

I just observed the seg fault running core-backend tests on Ubuntu on a branch newly-created from master. I'll see if I can repro with a locally-built (debuggable) addon.

nick4598 commented 1 year ago

Dev.41 is using node 18.16.1 and dev.40 is using 18.16.0, not sure if that would be enough to cause the issue.. 18.16.1 also just came out within 24 hours I believe.

Chuck also ran into segfaults when updating c-ares, and 18.16.1 appears to have included some c-ares vulnerabilities fixes. The fix in Chuck's case was to hide the symbols from the global space(This is similar to what Affan did to fix our segfaulting OpenSSL) as Node was stepping on the symbols from our version of c-ares in libsrc and causing segfaults. My guess is this is the cause.

We haven't produced a new addon with the fix, but it is already in master. https://github.com/iTwin/imodel-native/pull/297/files#diff-88ba601cab81905cdeda950a5c1189da911b5b755ac49e752ab75f19c5031eab:~:text=ifdef%20__unix,%25endif

We could possibly pin our node dependency down to 18.16.0 and get around this until we have an addon out.

pmconne commented 1 year ago

Maybe a stretch, but dev.41 is using node 18.16.1 and dev.40 is using 18.16.0

I repro'ed with 18.16.0. Only that once though. I suppose it's possible the crash is sporadic but more likely to occur with 18.16.1? I'll update to that.

chuckkir commented 1 year ago

It still could be the same problem with a different library; the symbols are weak objects, which means that the linker will choose one and it can change if one of the libraries changes. This article talks about the symbols pretty well although in the context of a different problem.

https://developers.redhat.com/articles/2021/10/27/compiler-option-hidden-visibility-and-weak-symbol-walk-bar#disabling_runtime_type_information

pmconne commented 1 year ago

Tests passed on macOS after @nick4598 forced them to use 18.16.0. Linux still running - no seg faults yet. I repro'ed immediately locally after switching to 18.16.1. Debugging...

pmconne commented 1 year ago
imodeljs.node!ares_timeout (Unknown Source:0)
imodeljs.node!Curl_resolver_getsock (Unknown Source:0)
imodeljs.node![Unknown/Just-In-Time compiled code] (Unknown Source:0)
imodeljs.node!curl_multi_wait (Unknown Source:0)
imodeljs.node![Unknown/Just-In-Time compiled code] (Unknown Source:0)
imodeljs.node!BentleyM0200::BeSQLite::CloudContainer::PollManifest() (Unknown Source:0)
imodeljs.node![Unknown/Just-In-Time compiled code] (Unknown Source:0)
v8impl::(anonymous namespace)::FunctionCallbackWrapper::Invoke(v8::FunctionCallbackInfo<v8::Value> const&) (Unknown Source:0)
v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) (Unknown Source:0)
v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) (Unknown Source:0)
Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_BuiltinExit (Unknown Source:0)
Builtins_InterpreterEntryTrampoline (Unknown Source:0)
[Unknown/Just-In-Time compiled code] (Unknown Source:0)
Builtins_JSConstructStubGeneric (Unknown Source:0)
nick4598 commented 1 year ago

Yep thats the same callstack we got when debugging Chuck's branch which also updated cares. The fix is in master on imodel-native, but not in an addon yet.

pmconne commented 1 year ago

Build to publish new addon keeps hanging on Linux. It stops producing output while running the following parts (same parts both times):

['ECPresentation:UnitTests-NonPublished', 'ECDb:RunGtest', 'iModelPlatform:UnitTests-NonPublished', 'iModelPlatform:BuildIModelEvolutionTests', 'Visualization:UnitTests']

I repro'ed locally on Ubuntu (freezes my shell). I failed to note the list of parts that were running when it hung. I rebuilt single-threaded (bb -N1 b) and again using 4 threads - both builds succeeded.

chuckkir commented 1 year ago

The build is taking the Linux boxes offline. I keep rerunning the one build and watch another machine go down. I feel like possibly the tests use a lot more memory than they did previously? What seems to be happening is that the box stops contacting the server so it is offline'd. My suspicion is that it is memory starved. I'm going to look at more logs to see if I can learn anything additional.

nick4598 commented 1 year ago

Reopening issue until we have a node addon for 3.x and 4.x which resolves the segfault. Fix is in both main branch and release/3.x branch already.

nick4598 commented 1 year ago

The build is taking the Linux boxes offline. I keep rerunning the one build and watch another machine go down. I feel like possibly the tests use a lot more memory than they did previously? What seems to be happening is that the box stops contacting the server so it is offline'd. My suspicion is that it is memory starved. I'm going to look at more logs to see if I can learn anything additional.

Were you able to get any information from the logs? I noticed there are a few new 'Prepare' tests added shortly before we attempted the new addon. Maybe those tests aren't at fault but were just enough to push our memory usage too high and make it more likely that the Linux boxes would crash? https://github.com/iTwin/imodel-native/commit/5b9169c494bad9869bc1e307789eff708ce8b072

chuckkir commented 1 year ago

Nothing new. The logs say that the connection was lost. The boxes don't crash, but they stop talking to the server so they get listed as "offline".

tm-zub commented 12 months ago

@nick4598 you mentioned that fix was already in the branch and you reopened to make sure for closing after verifying in 4.x. Can you check and mark this issue accordingly?

nick4598 commented 12 months ago

The 3.x fix is in itwinjs-core versions 3.7.11 and greater. Fix is also in all versions of itwinjs-core 4.1.x, and master as well.