nodejs / node-addon-api

Module for using Node-API from C++
MIT License
2.15k stars 460 forks source link

Napi::Value is not properly initialized the first time a program is run #1463

Closed martenrichter closed 5 months ago

martenrichter commented 6 months ago

I am maintaining a node.js addon that heavily switches between C++ and node.js code. Today, I found an issue that happens only once at every program execution. I have the following code:

      auto val =  Napi::Value::From(alarmjs_->Env(), timedelay);
      if (val.ToNumber().DoubleValue() != timedelay) { 
        printf("val %lg %lg\n", val.ToNumber().DoubleValue(), timedelay);
        val =  Napi::Value::From(alarmjs_->Env(), timedelay);
        printf("val2 %lg %lg\n", val.ToNumber().DoubleValue(), timedelay);
    }

The if clause checks if the correct number is assigned to the Napi::Value. The first time this code is called, the assignment is not correct. Here are some examples:

Server process
val 9.61433e-312 200000
val2 200000 200000
Client process
val 1.3559e-311 3999.79
val2 3999.79 3999.79
Server process second run
val 1.21757e-311 200000
val2 200000 200000
Client process second run
val 6.54648e-312 3999.79
val2 3999.79 3999.79

The value changes, so it smells like uninitialized memory.

So far, I have tested it on a Windows build with node version v18.15.0 built by clang version 14.0.4 and "node-addon-api": "^7.0.0". The JS part calls C++, and the JS part then calls C++, which might again call JS, etc. It is a very involved network add-on.

The problem may not be in node-addon-api but in node.js itself or v8, or I may be doing something stupid.

martenrichter commented 6 months ago

It does not occur on Linux with node version "v18.18.0." Also, it failed in a GitHub action on Windows, which uses node version v18.18.0.

KevinEady commented 6 months ago

Hi @martenrichter ,

You mentioned a GitHub Actions run. Do you have a repository with this code we can look at?

martenrichter commented 6 months ago

Sure: https://github.com/fails-components/webtransport the code in question is here: https://github.com/fails-components/webtransport/blob/9e2309936d7f4273f1c903873c30789305037572/transports/http3-quiche/src/napialarmfactory.h#L137 You can run the tests in the root directory with:

npm install
npm dobuild
npm run test:node

make sure to uncomment the diagnosis code above. It is in the package https://github.com/fails-components/webtransport/tree/master/transports/http3-quiche which is called by the main package https://github.com/fails-components/webtransport/tree/master/main It may be that it only occurs with the windows-2022 image and not the windows-2019 image, as I had to change it. You find the cmake.txt here: https://github.com/fails-components/webtransport/blob/master/transports/http3-quiche/CMakeLists.txt So, it has a different clang version. But it is a very complex project. I am sorry that I could not distill it to a small sample, but yesterday, at the end of the day, I was very happy to locate the source of the problem.

KevinEady commented 6 months ago

Hi @martenrichter ,

I broke your issue down into a simple Node API method:

Napi::Value Method(const Napi::CallbackInfo &info) {
  Napi::Env env = info.Env();
  double timedelay = 200000;
  auto val = Napi::Value::From(env, timedelay);
  if (val.ToNumber().DoubleValue() != timedelay) {
    Napi::Error::New(env, "Val does not match! " +
                              std::to_string(val.ToNumber().DoubleValue()))
        .ThrowAsJavaScriptException();
    return Napi::Value();
  }
  return Napi::Boolean::New(env, true);
}

I cannot get this to throw the error on both Node 18.15.0 and 18.18.0 with GitHub Actions using windows-2019, windows-2022, and ubuntu-20.04.

Is your issue also present on other node versions? Specifically the latest 18.20.1 and 20.12.1?

martenrichter commented 6 months ago

I have tested it to 18.20.1 on windows, and it fails as with the earlier versions. (For linux it was never an issue). I have also copied the code directly to the library initialization method, and it fails also there immediately. So, an influence of the calling sequence can be ruled out. I would expect that it depends on the compiler. I use cmake-js with -t', 'ClangCL' (in version 14.0.4). The choice is caused by some libraries that do not build with the standard MSVC compiler.

martenrichter commented 6 months ago

Okay, I have tried to make a simple example, but it did not fail (but I think node-addon-api and/or cmake-js triggered a reinstall of build tools). My main project continued to fail in the initial Javascript call. After a complete rebuild, it does not fail anymore. So, I believe there must have been a version update in the background that fixed it. I will now try to see if it is also fixed in GitHub actions.

martenrichter commented 6 months ago

Here is the pull request, with removed workaround: https://github.com/fails-components/webtransport/pull/287 let's see if it fails.

martenrichter commented 6 months ago

Ok, it fails inside the github environment.

martenrichter commented 6 months ago

It worked because I accidentally switched back to the MS Visual Studio 2019 generator, but the problem only occurred for MS Visual Studio 2022. Anyway, here is a repo with a minimal example, including a GitHub action, that fails reproducible: https://github.com/martenrichter/bugnodeaddonapi and of course, it is unclear if it is a bug in node-addon-api, node, v8, or the compiler.

martenrichter commented 6 months ago

Here: https://github.com/nodejs/node-addon-api/blob/2f490a3b5fa3633820ad4a61bfac0697fa84a9b4/napi-inl.h#L900 The value is not updated, and the code never updates the result; I have placed some printfs around.

martenrichter commented 6 months ago

I have tested it against a freshly build node (git master branch from the weekend), and it also occurs. My v8 knowledge is limited but maybe I can add some debugging code into the calls to see what is happening.

KevinEady commented 5 months ago

Hi @martenrichter ,

Thanks for the repro code! Using windows-2022 + ClangCL definitely is the trigger for the error. I was able to install the necessary reproduction deps on my Windows dev machine here and can trigger the error as well.

This failure can be seen with the very simple Node-API C code:

#include <assert.h>
#include <node_api.h>

constexpr double val = 200000;

static napi_value Init(napi_env env, napi_value exports) {
  napi_status status;
  napi_value result;

  status = napi_create_double(env, val, &result);
  assert(status == napi_ok);

  return result;
}

NAPI_MODULE(test, Init)

This module exports a double, and we can see that running it returns garbage:

PS > node -pe "require('./build/Release/test.node')"
7.87488889988e-312
PS > node -pe "require('./build/Release/test.node')"
8.548248570993e-312
PS > node -pe "require('./build/Release/test.node')"
8.59998422933e-312
PS > node -pe "require('./build/Release/test.node')"
1.208837098321e-311

As you pointed out, duplicating the napi_create_double call results in the expected 200000 returned.

Here's a table with some condensed findings:

Visual Studio Clang Success? Notes
Visual Studio 16 2019 Clang 12.0.0 GitHub Action
Visual Studio 17 2022 Clang 16.0.5 Personal dev machine
Visual Studio 17 2022 Clang 17.0.3 GitHub Action

I'll bring this up in the next Node API weekly meeting. In the meantime, I'll try to run the test on different Clang versions and see at what point it breaks.

cc: @mhdawson @gabrielschulhof @vmoroz

martenrichter commented 5 months ago

Hi @KevinEady, I am not sure if the clang version is the cause or if it just uncovers the problem. I would argue that all the new compiler does is pass the arguments to the node part. I wonder if printing the value inside napi_create_double in a custom node build may help determine whether the compiler is the cause. If the value is correct, and it also fails, it may just surface because the compiler arranges data differently in memory, which is accessed by a wrong pointer or something.

martenrichter commented 5 months ago

I have modifed node.js with:

napi_status NAPI_CDECL napi_create_double(napi_env env,
                                          double value,
                                          napi_value* result) {
  printf("napi_create_double %lg\n", value);
  CHECK_ENV_NOT_IN_GC(env);
  CHECK_ARG(env, result);

  *result =
      v8impl::JsValueFromV8LocalValue(v8::Number::New(env->isolate, value));

  return napi_clear_last_error(env);
}

And it prints out the same garbage, which is later inside the result. So it seems like the compiler is not passing value properly. What I do not understand is that it works in a second call.

KevinEady commented 5 months ago

So I did a little more digging... Adding another piece to the puzzle:

It doesn't seem like it's the compiler version, but maybe how it's being invoked within cmake-js or the underlying MS ClangCL toolset?

When building the native module with cmake-js and ClangCL, we see the error... but if we switch to a "manual" compilation with clang++, it seems to work fine. I was unable to get clang++ nor clangcl to work with Ninja, so opted for a simple powershell script.

We see the proper value returned for the clang++ module (GitHub Action) but not for the clangcl module (GitHub Action).

FWIW, cmake and/or cmake-js uses a bunch of flags with ClangCL that i'm not sure of. When I build the generated project manually with msbuild, we see:

ClCompile:
  C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\Llvm\x64\bin\clang-cl.exe /c /I"C:\Users\pc\.cmake-js\node-x64\v20.8.0\include\node" /I"C:\Users\pc\Documents\Projects\vcpkg\installed\x64-windows\include" /nologo
   /W1 /WX- /diagnostics:column /O2 /Ob2 /D _WINDLL /D _MBCS /D WIN32 /D _WINDOWS /D NDEBUG /D "CMAKE_INTDIR=\"Release\"" /D test_clangcl_EXPORTS /EHsc /MT /GS /fp:precise /GR /Fo"test-clangcl.dir\Release\\" /Gd /TP -m64  D:\tmp\bugnode
  addonapi\main.cc "D:\tmp\bugnodeaddonapi\node_modules\cmake-js\lib\cpp\win_delay_load_hook.cc"
Link:
  C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\Llvm\x64\bin\lld-link.exe /OUT:"D:\tmp\bugnodeaddonapi\build\Release\test-clangcl.node" /INCREMENTAL:NO /LIBPATH:"C:\Users\pc\Documents\Projects\vcpkg\installed\x6
  4-windows\lib" /LIBPATH:"C:\Users\pc\Documents\Projects\vcpkg\installed\x64-windows\lib\manual-link" "C:\Users\pc\.cmake-js\node-x64\v20.8.0\win-x64\node.lib" kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut
  32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /manifest:embed /PDB:"D:/tmp/bugnodeaddonapi/build/Release/test-clangcl.pdb" /SUBSYSTEM:CONSOLE /DYNAMICBASE /NXCOMPAT /IMPLIB:"D:/t
  mp/bugnodeaddonapi/build/Release/test-clangcl.lib"  /DLL "test-clangcl.dir\Release\main.obj"
  "test-clangcl.dir\Release\win_delay_load_hook.obj"
  test-clangcl.vcxproj -> D:\tmp\bugnodeaddonapi\build\Release\test-clangcl.node

Curious, very curious... 🤔

martenrichter commented 5 months ago

May be a stupid question, but does the clang++ build also use the msvc stdlib?

martenrichter commented 5 months ago

I have asked chatgpt to explain the compile options (so handle the output with care, often it is wrong...):

/MT: This specifies the use of the multithreaded, static version of the run-time library.

/GS: This option enables buffer security checks (stack buffer overrun detection).

/fp:precise: This specifies precise floating-point semantics.

/GR: This enables RTTI (Run-Time Type Information) support.

/Fo"test-clangcl.dir\Release\\": This specifies the name and location of the object file to be produced.

/Gd: This specifies the calling convention. /Gd specifies cdecl calling convention

I would test the options "/MT" using static multthreading runtime library, the /GS option may be also something, maybe /fp:precise: or /Gd the calling convention.

martenrichter commented 5 months ago

I have turned on the debugger and disassembly. I did not come far, since I have to stop now, but I have seen, that before the napi call ""D:\tmp\bugnodeaddonapi\node_modules\cmake-js\lib\cpp\win_delay_load_hook.cc" is invoked. May be this is bad guy, since it is missing in the other compile.

KevinEady commented 5 months ago

Hi @martenrichter,

I have been able to compile the native addon with command-line using cl, clang-cl, and clang++.

So you are on the right path: using the delay load import feature is causing the undefined behavior.

The linker links with this DELAYIMP.LIB file in order to use the /DELAYLOAD:NODE.EXE linker argument. Removing the /DELAYLOAD:NODE.EXE argument fixes the undefined behavior.

It makes sense that we wouldn't see it with clang++ since there is no similar concept of this.

However, I can't figure out why we see the problem with clang-cl and not cl. Passing the exact same arguments to cl is successful, but to clang-cl is a failure.

martenrichter commented 5 months ago

I have found the following bug report: https://github.com/llvm/llvm-project/issues/51941 It seems to match,, but it is already present in the 2019 release. Anyway, it's a good start for diagnosis.

KevinEady commented 5 months ago

Nice find @martenrichter ! I've written a bit on that issue. We'll see if it'll gain any traction.

So, I guess you have a few options:

  1. Not use the ClangCL toolset (which i suppose is not an option, since you mentioned "some libraries that do not build with the standard MSVC compiler")
  2. Use an in-code workaround (eg. like you did, but maybe place the napi_create_double() inside your module registration function? so it's "fixed" globally for your addon?)
  3. Remove the delayload from cmake-js. It looks like cmake-js added the delayload to be compatible with Electron 4.0 on Apr 2019: PR. This might be an option if you do not need to support Electron. You can either fork the module, or see if the maintainer would add (or accept a PR that adds) an option to bypass using delayload.

Thanks for all your help troubleshooting this week! Since this is not an issue with node-addon-api / Node-API itself, I believe we can close this issue.

martenrichter commented 5 months ago

@KevinEady Thanks for your help! Yes, I also think we can close the issue, and I hope it gets some attention, as this is unexpected behavior. I will recheck 1, but probably one of Google libs, such as quiche, boringssl, or abseil, does not work. So I will go 2 and see if I can do it more elegantly (I checked the library registration option earlier in the morning). 3 does not seem to be nice. I do not need it, but since I provide a new browser network protocol, there may be others who do (electron is not the only one; in principle, all alternatives to node.js also may need it).

And yes we can close the issue, node-addon-api/ Node-API is innocent, but it is really annoying that clang breaks the stable API.