Closed pmconne closed 1 year ago
@chuckkir @nick4598 any theories?
Not observed in nightly job for 4.1.0-dev.40. Observed in nightly job for 4.1.0-dev.41 There were no commits to master in between those two.
I just observed the seg fault running core-backend tests on Ubuntu on a branch newly-created from master. I'll see if I can repro with a locally-built (debuggable) addon.
Dev.41 is using node 18.16.1 and dev.40 is using 18.16.0, not sure if that would be enough to cause the issue.. 18.16.1 also just came out within 24 hours I believe.
Chuck also ran into segfaults when updating c-ares, and 18.16.1 appears to have included some c-ares vulnerabilities fixes. The fix in Chuck's case was to hide the symbols from the global space(This is similar to what Affan did to fix our segfaulting OpenSSL) as Node was stepping on the symbols from our version of c-ares in libsrc and causing segfaults. My guess is this is the cause.
We haven't produced a new addon with the fix, but it is already in master. https://github.com/iTwin/imodel-native/pull/297/files#diff-88ba601cab81905cdeda950a5c1189da911b5b755ac49e752ab75f19c5031eab:~:text=ifdef%20__unix,%25endif
We could possibly pin our node dependency down to 18.16.0 and get around this until we have an addon out.
Maybe a stretch, but dev.41 is using node 18.16.1 and dev.40 is using 18.16.0
I repro'ed with 18.16.0. Only that once though. I suppose it's possible the crash is sporadic but more likely to occur with 18.16.1? I'll update to that.
It still could be the same problem with a different library; the symbols are weak objects, which means that the linker will choose one and it can change if one of the libraries changes. This article talks about the symbols pretty well although in the context of a different problem.
Tests passed on macOS after @nick4598 forced them to use 18.16.0. Linux still running - no seg faults yet. I repro'ed immediately locally after switching to 18.16.1. Debugging...
imodeljs.node!ares_timeout (Unknown Source:0)
imodeljs.node!Curl_resolver_getsock (Unknown Source:0)
imodeljs.node![Unknown/Just-In-Time compiled code] (Unknown Source:0)
imodeljs.node!curl_multi_wait (Unknown Source:0)
imodeljs.node![Unknown/Just-In-Time compiled code] (Unknown Source:0)
imodeljs.node!BentleyM0200::BeSQLite::CloudContainer::PollManifest() (Unknown Source:0)
imodeljs.node![Unknown/Just-In-Time compiled code] (Unknown Source:0)
v8impl::(anonymous namespace)::FunctionCallbackWrapper::Invoke(v8::FunctionCallbackInfo<v8::Value> const&) (Unknown Source:0)
v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) (Unknown Source:0)
v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) (Unknown Source:0)
Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_BuiltinExit (Unknown Source:0)
Builtins_InterpreterEntryTrampoline (Unknown Source:0)
[Unknown/Just-In-Time compiled code] (Unknown Source:0)
Builtins_JSConstructStubGeneric (Unknown Source:0)
Yep thats the same callstack we got when debugging Chuck's branch which also updated cares. The fix is in master on imodel-native, but not in an addon yet.
Build to publish new addon keeps hanging on Linux. It stops producing output while running the following parts (same parts both times):
['ECPresentation:UnitTests-NonPublished', 'ECDb:RunGtest', 'iModelPlatform:UnitTests-NonPublished', 'iModelPlatform:BuildIModelEvolutionTests', 'Visualization:UnitTests']
I repro'ed locally on Ubuntu (freezes my shell). I failed to note the list of parts that were running when it hung. I rebuilt single-threaded (bb -N1 b
) and again using 4 threads - both builds succeeded.
The build is taking the Linux boxes offline. I keep rerunning the one build and watch another machine go down. I feel like possibly the tests use a lot more memory than they did previously? What seems to be happening is that the box stops contacting the server so it is offline'd. My suspicion is that it is memory starved. I'm going to look at more logs to see if I can learn anything additional.
Reopening issue until we have a node addon for 3.x and 4.x which resolves the segfault. Fix is in both main branch and release/3.x branch already.
The build is taking the Linux boxes offline. I keep rerunning the one build and watch another machine go down. I feel like possibly the tests use a lot more memory than they did previously? What seems to be happening is that the box stops contacting the server so it is offline'd. My suspicion is that it is memory starved. I'm going to look at more logs to see if I can learn anything additional.
Were you able to get any information from the logs? I noticed there are a few new 'Prepare' tests added shortly before we attempted the new addon. Maybe those tests aren't at fault but were just enough to push our memory usage too high and make it more likely that the Linux boxes would crash? https://github.com/iTwin/imodel-native/commit/5b9169c494bad9869bc1e307789eff708ce8b072
Nothing new. The logs say that the connection was lost. The boxes don't crash, but they stop talking to the server so they get listed as "offline".
@nick4598 you mentioned that fix was already in the branch and you reopened to make sure for closing after verifying in 4.x. Can you check and mark this issue accordingly?
The 3.x fix is in itwinjs-core versions 3.7.11 and greater. Fix is also in all versions of itwinjs-core 4.1.x, and master as well.
Describe the bug
rush cover
in CI jobs is producing segmentation faults on mac and linux.To Reproduce Steps to reproduce the behavior:
Screenshots
Desktop (please complete the applicable information):
Additional context
Failing build pipeline. First observed in #5660. Occurred on all 3 runs of the pipeline.
5661 produces similar results, with no code changes vs master.
Each test suite crash without completing a single test. Only test suites that use @itwin/core-backend are affected. The most recent addon included upgrades of several third-party libraries. These failures were not observed at the time the new addon was integrated. I fail to reproduce the problem running
rush cover
on Ubuntu 22.04.