napi-rs / napi-rs

A framework for building compiled Node.js add-ons in Rust via Node-API
https://napi.rs
Other
6.74k stars 310 forks source link

NAPI custom gc segfault #2555

Open BlobMaster41 opened 2 months ago

BlobMaster41 commented 2 months ago

My program segfault fatally after ~30minutes with a reference deletion problem. Here is the full backtrace from gdb:

0x00000000013ea173 in v8::internal::GlobalHandles::Destroy(unsigned long*) ()
(gdb) bt
#0  0x00000000013ea173 in v8::internal::GlobalHandles::Destroy(unsigned long*) ()
#1  0x0000000000f06292 in v8impl::Reference::~Reference() ()
#2  0x0000000000f1158b in napi_delete_reference ()
#3  0x00007ffed4c5bfdf in napi::bindgen_runtime::module_register::custom_gc (env=0x7ffe2820f580, _js_callback=0x7ffe2806f4a8, _context=0x0, data=0x7ffc5aaebcc0) at src/bindgen_runtime/module_register.rs:634
#4  0x0000000000f2cbc9 in v8impl::(anonymous namespace)::ThreadSafeFunction::AsyncCb(uv_async_s*) ()
#5  0x0000000001d2bc43 in uv__async_io (loop=0x7ffe3fbff998, w=<optimized out>, events=<optimized out>) at ../deps/uv/src/unix/async.c:176
#6  0x0000000001d40974 in uv__io_poll (loop=loop@entry=0x7ffe3fbff998, timeout=<optimized out>) at ../deps/uv/src/unix/linux.c:1528
#7  0x0000000001d2c967 in uv_run (loop=0x7ffe3fbff998, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:448
#8  0x0000000000e726d6 in node::SpinEventLoopInternal(node::Environment*) ()
#9  0x00000000010aa387 in node::worker::Worker::Run() ()
#10 0x00000000010aa539 in node::worker::Worker::StartThread(v8::FunctionCallbackInfo<v8::Value> const&)::{lambda(void*)#1}::_FUN(void*) ()
#11 0x00007ffff789caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#12 0x00007ffff7929c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

It seems that there is a problem in a async function cleaning up a buffer object in custom_gc:

https://github.com/napi-rs/napi-rs/blob/napi%402.16.17/crates/napi/src/bindgen_runtime/module_register.rs#L602

I have no clue how to replicate easily but it does happen.

BlobMaster41 commented 1 month ago

After some more investigation, I can confirm that this is the problem, I have recompiled napi from source and removed the entire custom_gc function. The program no longer segfault.

After ~2milion calls, my program now leak about 53GB of memory. It is not perfect but at least it prevent a fatal segfault.

This is a critial issue, I can not identify the cause of this behavior.

You may take a look at my project source:

https://github.com/btc-vision/op-vm

Brooooooklyn commented 1 month ago

It should be fixed in 3.0.0-alpha, can you upgrade and test again? @BlobMaster41

BlobMaster41 commented 1 month ago

Will try it asap, I need to make it compatible with NAPI 3.

BlobMaster41 commented 4 weeks ago

Hey @Brooooooklyn, I have converted my vm to napi3 and im having a problem and I dont understand what is happening. It's coming from napi3.

when calling a threadsafe function from rust, im getting the following result:

Error { status: "Ok", reason: "" } Error calling tsfn function: Ok

whatever I do.

You may take a look at my implementation:

https://github.com/btc-vision/op-vm/pull/118

It is very critical that I find my problem before going into production.

Thanks for your help!

Brooooooklyn commented 4 weeks ago

@BlobMaster41 how can I reproduce it in your project? Can you give me the reproduce steps?

BlobMaster41 commented 3 weeks ago

Hey @Brooooooklyn yes.

You can easily replicate the problem by following these specific steps:

  1. Create a folder
  2. Clone https://github.com/btc-vision/op-vm/tree/merge/napi3 make sure that you are in the branch merge/napi3
  3. Run npm i && npm run build in the cloned project
  4. Go back to the root folder you created earlier
  5. Clone https://github.com/btc-vision/unit-test-framework/tree/upgrade/napi3 make sure you are in the branch upgrade/napi3
  6. Run npm i && npm run build in that folder.
  7. npm run build:test-contract
  8. Run npm run test:memory or npm run test:call-depth

If one test fail with:

[OPNetUnit DEBUG]: Running test: Call depth tests - should fail to do more nested calls than the maximum allowed
ExitData::to_napi_value
[GenericExternalFunction] Executing with data: [0, 0, 0, 0, 1, 248, 240, 149, 0, 0, 0, 1, 119, 29, 114, 31, 164, 121, 175, 78, 70, 168, 44, 87, 112, 196, 115, 227, 100, 118, 243, 46, 77, 231, 51, 215, 18, 218, 214, 95, 143, 146, 48, 208, 0, 0, 0, 8, 97, 245, 7, 2, 0, 0, 0, 199]
Error { status: "Ok", reason: "" }
Error calling tsfn function: Ok
Error: RuntimeError:
    at <unnamed> (<module>[79]:0x1b32)
    at <unnamed> (<module>[81]:0x1e17)
    at <unnamed> (<module>[69]:0x187d)

Caused by:

Root cause: RuntimeStringError { details: "" }

That's a problem. I added println in the code that log the response that the call tsfn.call_async returns.

You can search in the op-vm project:

let fut = async move {
            println!(
                "[GenericExternalFunction] Executing with data: {:?}",
                request.buffer
            );

            let promise = tsfn.call_async(Ok(request)).await;
            let promise = match promise {
                Ok(promise) => promise,
                Err(e) => {
                    println!("{:?}", e);
                    println!("Error calling tsfn function: {}", e);
                    return Err(RuntimeError::new(e.reason));
                }
            };

            let buffer = promise.await.map_err(|e| {
                println!("Error awaiting promise: {}", e);
                RuntimeError::new(e.reason)
            })?;

            Ok(buffer.to_vec())
        };

        runtime.block_on(fut)
    }

If after fixing the issue an other problem raise, let me know, I don't know if the tests will still pass under napi3.

BlobMaster41 commented 3 weeks ago

Good news. I located the cause of the Ok().

The problem raise from threadsafe_function.rs

Here is the code that have a problem:

      if let ThreadsafeFunctionCallVariant::WithCallback = call_variant {
        // throw Error in JavaScript callback
        let callback_arg = if status == sys::Status::napi_pending_exception {
          let mut exception = ptr::null_mut();
          status = unsafe { sys::napi_get_and_clear_last_exception(raw_env, &mut exception) };
          let mut error_reference = ptr::null_mut();
          unsafe { sys::napi_create_reference(raw_env, exception, 1, &mut error_reference) };
          Err(Error {
            maybe_raw: error_reference,
            maybe_env: raw_env,
            raw: true,
            status: Status::from(status),
            reason: "".to_owned(),
          })
        } else {
          unsafe { Return::from_napi_value(raw_env, return_value) }
        };
        if let Err(err) = callback(callback_arg, Env::from_raw(raw_env)) {
          unsafe { sys::napi_fatal_exception(raw_env, JsError::from(err).into_value(raw_env)) };
        }
      }
      status
    }

If you log

Error {
            maybe_raw: error_reference,
            maybe_env: raw_env,
            raw: true,
            status: Status::from(status),
            reason: "".to_owned(),
          }

It will always log "Ok" for some reasons. But if you do:

if let ThreadsafeFunctionCallVariant::WithCallback = call_variant {
        // throw Error in JavaScript callback
        let callback_arg = if status == sys::Status::napi_pending_exception {
          let mut exception = ptr::null_mut();
          unsafe { sys::napi_get_and_clear_last_exception(raw_env, &mut exception) };

          let mut error_ref = ptr::null_mut();
          status = unsafe { sys::napi_create_reference(raw_env, exception, 1, &mut error_ref) };

          let err: Error = unsafe {
            JsUnknown::from_raw_unchecked(raw_env, exception)
          }.into();

          println!("callback error: {}", err);

          let err = Error {
            maybe_raw: error_ref,
            maybe_env: raw_env,
            raw: true,
            status: Status::from(status),
            reason: String::new(),
          };

          Err(err)
        } else {
          unsafe { Return::from_napi_value(raw_env, return_value) }
        };

        if let Err(err) = callback(callback_arg, Env::from_raw(raw_env)) {
          println!("callback returned error: {:?}", err);
          unsafe { sys::napi_fatal_exception(raw_env, JsError::from(err).into_value(raw_env)) };
        }
      }
      status
    }

Now, you can see this in console:

callback error: GenericFailure, TypeError: Cannot read properties of null (reading 'buffer')

This makes me think that something in wrong in the snippet I send for napi 3. Error management have a problem?

BlobMaster41 commented 3 weeks ago

If I do

if let ThreadsafeFunctionCallVariant::WithCallback = call_variant {
        // throw Error in JavaScript callback
        let callback_arg = if status == sys::Status::napi_pending_exception {
          let mut exception = ptr::null_mut();
          unsafe { sys::napi_get_and_clear_last_exception(raw_env, &mut exception) };

          let mut error_ref = ptr::null_mut();
          status = unsafe { sys::napi_create_reference(raw_env, exception, 1, &mut error_ref) };

          let err: Error = unsafe {
            JsUnknown::from_raw_unchecked(raw_env, exception)
          }.into();

          Err(err)
        } else {
          unsafe { Return::from_napi_value(raw_env, return_value) }
        };

        if let Err(err) = callback(callback_arg, Env::from_raw(raw_env)) {
          println!("callback returned error: {:?}", err);
          unsafe { sys::napi_fatal_exception(raw_env, JsError::from(err).into_value(raw_env)) };
        }
      }
      status
    }

I get the error in js as well but, I don't know if this is valid since I didnt make napi. There could be an other error if something else goes wrong.

BlobMaster41 commented 3 weeks ago

On an other note, once I applied this patch & corrected the js error I can run my unit tests.

One observation that I have already is:

It hangs for a couple seconds when the execution is completed.

I didnt had this issue on napi2.7

I will check to see if it's still segfault.

BlobMaster41 commented 3 weeks ago

I can confirm it still segfault.

I will investigate with gdb and tell you whats the new trace.

Brooooooklyn commented 3 weeks ago

@BlobMaster41 can not get npm run test-contract work

Image

BlobMaster41 commented 3 weeks ago

Hey @Brooooooklyn sorry! I had an other local dependency I forgot to change. I pushed to the same branch again for the repo "unit-test-framework". It should work now, please run npm i and try again!

Please note that you will run the issue with the error handling in napi3 like explained in my previous comments.

I patched this problem locally with what I mentioned in my comments. I dont know if this is the correct solution but, napi3 hang for some reasons. You will see what I mean once you have it working.

--- AN OTHER ISSUE UNREALTED TO THE NAPI3 ERROR ISSUE (still segfault under napi3 as well) ---

To check the hanging, please go in the following branches (I fixed the problem that the merge/napi3 branch have, i had a js bug, thats why I was unable to upgrade to napi3 but having this problem made me found an other problem in napi3.)

Switch to: op-vm -> error/fix-napi3-error-handling unit-test-framework -> napi3-fix-test

You should see the hang after running the same tests. It hangs longer on intel than amd for some reasons.

BlobMaster41 commented 3 weeks ago

I collected the backtrace on NAPI3:

Thread 19 "node" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffe87fff700 (LWP 1707879)]
0x0000000001363f23 in v8::internal::GlobalHandles::Destroy(unsigned long*) ()
(gdb) bt
#0  0x0000000001363f23 in v8::internal::GlobalHandles::Destroy(unsigned long*) ()
#1  0x0000000000ee9f32 in v8impl::Reference::~Reference() ()
#2  0x0000000000ef511f in napi_delete_reference ()
#3  0x00007ffdbd5eecf5 in ?? () from /root/op-vm/op-vm.linux-x64-gnu.node
#4  0x0000000000f111c9 in v8impl::(anonymous namespace)::ThreadSafeFunction::AsyncCb(uv_async_s*) ()
#5  0x0000000001ca72f3 in uv__async_io (loop=0x7ffe87ffe9c8, w=<optimized out>, events=<optimized out>) at ../deps/uv/src/unix/async.c:176
#6  0x0000000001cbce64 in uv__io_poll (loop=loop@entry=0x7ffe87ffe9c8, timeout=<optimized out>) at ../deps/uv/src/unix/linux.c:1564
#7  0x0000000001ca8017 in uv_run (loop=0x7ffe87ffe9c8, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:458
#8  0x0000000000e526d6 in node::SpinEventLoopInternal(node::Environment*) ()
#9  0x000000000109ab47 in node::worker::Worker::Run() ()
#10 0x000000000109acf9 in node::worker::Worker::StartThread(v8::FunctionCallbackInfo<v8::Value> const&)::{lambda(void*)#1}::_FUN(void*) ()
#11 0x00007ffff7c51609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#12 0x00007ffff7b76353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Brooooooklyn commented 3 weeks ago

@BlobMaster41 I just remembered, this error is not related to NAPI-RS; it occurs when using hard link .node file in Node.js

https://stackoverflow.com/questions/45954861/how-to-circumvent-dlopen-caching#:~:text=5-,POSIX%20says%3A,-Only%20a%20single

Brooooooklyn commented 3 weeks ago

I've upgraded to NAPI-RS beta in your project, here is pr: https://github.com/btc-vision/op-vm/pull/121

beta.2 enhanced the error status and messages:

Image

BlobMaster41 commented 3 weeks ago

Hey @Brooooooklyn. Thanks for this patch about errors. I have a working branch that you can try:

Switch to: op-vm -> error/fix-napi3-error-handling unit-test-framework -> napi3-fix-test

I noticed that switching from napi2.7 to napi3 hang now. When the program is done running, it takes about ~10second for it to fully stop. On napi2.7 I had the same issue, I resolved it by doing abort on tsfn but on NAPI3 I think the drop is now automatic?

Eitherway, when using napi3 now, it hang.

After running intensive tests, it still segfaults.

What I am wondering is, what if two library use napi? Could that be the cause of the issue?

Our program use op-vm and https://github.com/btc-vision/rust-merkle-tree to generate merkle tree and this library also use napi.

Now, the segfault happens in op-vm but could something from rust-merkle-tree break the gc collection of op-vm?

I will provide a new gdb dump using napi3.beta2.

Brooooooklyn commented 3 weeks ago

@BlobMaster41 If you don't want the ThreadsafeFunction block your program, you can declare the ThreadsafeFunction as Weak, change the Weak to true here:

Image

Our program use op-vm and https://github.com/btc-vision/rust-merkle-tree to generate merkle tree and this library also use napi.

No, it shouldn't be a problem; We have several servers that depend on 4 to 5 NAPI-RS libraries and have been running stably for several years.

Your segfault is caused by hardlink and dll cache as the describe in https://stackoverflow.com/questions/45954861/how-to-circumvent-dlopen-caching#:~:text=5-,POSIX%20says%3A,-Only%20a%20single

When you declare dependency like this, the npm create hardlink of the project in the node_modules:

Image

If you want to avoid the segfault, you can copy the dist manually to the node_modules under the unit-test-framework rather than declare the file:.. protocol dependency of it.

BlobMaster41 commented 3 weeks ago

@BlobMaster41 If you don't want the ThreadsafeFunction block your program, you can declare the ThreadsafeFunction as Weak, change the Weak to true here:

Image

Our program use op-vm and https://github.com/btc-vision/rust-merkle-tree to generate merkle tree and this library also use napi.

No, it shouldn't be a problem; We have several servers that depend on 4 to 5 NAPI-RS libraries and have been running stably for several years.

Your segfault is caused by hardlink and dll cache as the describe in https://stackoverflow.com/questions/45954861/how-to-circumvent-dlopen-caching#:~:text=5-,POSIX%20says%3A,-Only%20a%20single

When you declare dependency like this, the npm create hardlink of the project in the node_modules:

Image

If you want to avoid the segfault, you can copy the dist manually to the node_modules under the unit-test-framework rather than declare the file:.. protocol dependency of it.

Hey, thanks for responding. The ../op-vm is only for dev. In production its set to @btc-vision/op-vm and it still segfaults.

I dont think thats the issue..

BlobMaster41 commented 3 weeks ago

Could it be caused because we are using it in worker threads?

Brooooooklyn commented 3 weeks ago

Could it be caused because we are using it in worker threads?

Yes it could be. The biggest problem here is that I can't reliably reproduce this segfault, so I can't debug it.
Also, which version of Node is your program that encountered the segfault running, and what operating system and CPU model are you using?

BlobMaster41 commented 3 weeks ago

Hey @Brooooooklyn, for the segfault, it happens on my pc and my servers that use way different cpu, im on intel and my servers are on different amd versions.

To reproduce the segfault, i would have to explain to you how to do it. Its in the project opnet-node but it requires some specific configurations. Do you have any chats like telegram or discord I can reach out to you on?

Brooooooklyn commented 2 weeks ago

@BlobMaster41 I can confirm it was caused by worker_threads, I can reproduce it in the unit test here: https://github.com/napi-rs/napi-rs/blob/main/examples/napi/__tests__/worker-thread.spec.ts#L55

If I change the unit test from worker_threads into the normal Node.js main thread code, the issue was gone.

Maybe related: https://github.com/nodejs/node/issues/55706

BlobMaster41 commented 2 weeks ago

@Brooooooklyn Whats the next step from here? I need worker threads..

BlobMaster41 commented 2 weeks ago

Ok I tried some stuff out and I noticed that if I change everything from Buffer to Uint8Array, it does not segfault but node js crash instead.

#
# Fatal error in , line 0
# Check failed: node->IsInUse().
#
#
#
#FailureMessage Object: 0x7fe477fba1e0
----- Native stack trace -----

 1: 0xfe3191  [node]
 2: 0x279da3b V8_Fatal(char const*, ...) [node]
 3: 0x1363ff9 v8::internal::GlobalHandles::Destroy(unsigned long*) [node]
 4: 0xe51672 node::CallbackScope::~CallbackScope() [node]
 5: 0xf1123a  [node]
 6: 0x1ca72f3  [node]
 7: 0x1cbce64  [node]
 8: 0x1ca8017 uv_run [node]
 9: 0xe526d6 node::SpinEventLoopInternal(node::Environment*) [node]
10: 0x109ab47 node::worker::Worker::Run() [node]
11: 0x109acf9  [node]
12: 0x7fe9818a9609  [/lib/x86_64-linux-gnu/libpthread.so.0]
13: 0x7fe9817ce353 clone [/lib/x86_64-linux-gnu/libc.so.6]
Trace/breakpoint trap (core dumped)
BlobMaster41 commented 2 weeks ago

Even converting everything to strings and attempting to send string from node js to napi and string from napi to node js result in a fatal segfault after a while.

BlobMaster41 commented 2 weeks ago

An other older alternative to worker_threads is cluster. Do you think cluster could work instead? I could switch my code to use cluster instead of worker_threads temporally until this is resolved.

Brooooooklyn commented 2 weeks ago

@BlobMaster41 I created a simple example to demonstrate how to maintain API consistency while avoiding the use of ThreadsafeFunction in worker_threads: https://github.com/Brooooooklyn/threadsafe_function_in_woker_threads_workaround

Brooooooklyn commented 1 week ago

@BlobMaster41 can you try the napi@3.0.0-beta.4? I've made some workaround for the ThreadsafeFunction usages

BlobMaster41 commented 1 week ago

@BlobMaster41 can you try the napi@3.0.0-beta.4? I've made some workaround for the ThreadsafeFunction usages

Thanks! Give me 1h and I try that. Hopefully it resolves the issue :D

BlobMaster41 commented 1 week ago

Give me a bit I have so many errors..

Brooooooklyn commented 1 week ago

@BlobMaster41 there is a breaking change here: https://github.com/napi-rs/napi-rs/pull/2672. it's about ThreadsafeFunction signature

BlobMaster41 commented 1 week ago

@BlobMaster41 there is a breaking change here: #2672. it's about ThreadsafeFunction signature

Ya, just fixed everything. Im trying now

BlobMaster41 commented 1 week ago

Hey @Brooooooklyn https://github.com/btc-vision/op-vm/actions/runs/15458050738/job/43514025151?pr=122 do you know whys that?

BlobMaster41 commented 1 week ago

@Brooooooklyn Still segfault sadly...

This time it got a super long backtrace.

#0  0x0000000001363f23 in v8::internal::GlobalHandles::Destroy(unsigned long*) ()
#1  0x0000000000ee9f32 in v8impl::Reference::~Reference() ()
#2  0x0000000000ef511f in napi_delete_reference ()
#3  0x00007ffd6f259dac in <napi::bindgen_runtime::js_values::buffer::Buffer as core::ops::drop::Drop>::drop (self=0x7ffcb7ff9ad0) at src/bindgen_runtime/js_values/buffer.rs:346
#4  0x00007ffd6f25906b in core::ptr::drop_in_place<napi::bindgen_runtime::js_values::buffer::Buffer> () at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/core/src/ptr/mod.rs:523
#5  0x00007ffd6f156cb3 in core::ptr::drop_in_place<core::option::Option<napi::bindgen_runtime::js_values::buffer::Buffer>> () at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/core/src/ptr/mod.rs:523
#6  0x00007ffd6f1f284e in op_vm::interfaces::napi::js_contract_manager::ContractManager::instantiate (self=0x7ed939e54d40, reserved_id=..., address=..., bytecode=..., used_gas=..., max_gas=..., memory_pages_used=...,
    network=op_vm::interfaces::napi::bitcoin_network_request::BitcoinNetworkRequest::Testnet, is_debug_mode=false, return_proofs=false) at src/interfaces/napi/js_contract_manager.rs:393
#7  0x00007ffd6f0de07b in op_vm::interfaces::napi::js_contract_manager::__napi_impl_helper_ContractManager_0::_napi_internal_register_instantiate::{{closure}} (cb=...) at src/interfaces/napi/js_contract_manager.rs:208
#8  0x00007ffd6f0cf9ed in core::result::Result<T,E>::and_then (self=..., op=...) at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/core/src/result.rs:1353
#9  0x00007ffd6f1f6874 in op_vm::interfaces::napi::js_contract_manager::__napi_impl_helper_ContractManager_0::_napi_internal_register_instantiate (env=0x7ffca02a3df0, cb=0x7ffcb7ff9e00) at src/interfaces/napi/js_contract_manager.rs:208
#10 0x0000000000ee9da5 in v8impl::(anonymous namespace)::FunctionCallbackWrapper::Invoke(v8::FunctionCallbackInfo<v8::Value> const&) ()
#11 0x00007ffc9fe0f745 in ?? ()
#12 0x00007ffcb7ff9e80 in ?? ()
#13 0x00007ffcb7ff9eb8 in ?? ()
#14 0x0000000000000009 in ?? ()
#15 0x0000000000000080 in ?? ()
#16 0x00007ffcb7ff9e40 in ?? ()
#17 0x0000000000000006 in ?? ()
#18 0x00007ffcb7ff9f60 in ?? ()
#19 0x00007ffc8040c39f in ?? ()
#20 0x000039cfea2a8819 in ?? ()
#21 0x00007ffca0002000 in ?? ()
#22 0x00003e9da8d80069 in ?? ()
#23 0x00003e9da8d80069 in ?? ()
#24 0x000014769610e4e9 in ?? ()
#25 0x00003e9da8d80069 in ?? ()
#26 0x000039cfea2a8819 in ?? ()
#27 0x0000370477847539 in ?? ()
#28 0x0000370477845891 in ?? ()
#29 0x0000370477847609 in ?? ()
#30 0x0000370477847689 in ?? ()
#31 0x00003704778476c1 in ?? ()
#32 0x00003704778476f1 in ?? ()
#33 0x0000000100000000 in ?? ()
#34 0x00003e9da8d800d9 in ?? ()
#35 0x00003e9da8d800d9 in ?? ()
#36 0x0000370477847689 in ?? ()
#37 0x00003704778476c1 in ?? ()
#38 0x0000370477847609 in ?? ()
#39 0x0000370477845891 in ?? ()
#40 0x00003704778476f1 in ?? ()
#41 0x0000370477847539 in ?? ()
#42 0x0000000100000000 in ?? ()
#43 0x000039cfea2a8819 in ?? ()
#44 0x000035b75c5c2ab1 in ?? ()
#45 0x0000000000000001 in ?? ()
#46 0x000014769610efd1 in ?? ()
#47 0x000035b75c5c2ab1 in ?? ()
#48 0x00007ffcb7ff9fc8 in ?? ()
#49 0x00007ffc804db824 in ?? ()
#50 0x0000370477847411 in ?? ()
#51 0x00003e9da8d80109 in ?? ()
#52 0x00007ffc9fe0a702 in ?? ()
#53 0x000019172f762d19 in ?? ()
#54 0x000039cfea2a8819 in ?? ()
#55 0x000039cfea2a8819 in ?? ()
#56 0x0000370477847411 in ?? ()
#57 0x00002d839551da39 in ?? ()
#58 0x0000000000000002 in ?? ()
#59 0x000035b75c5d4019 in ?? ()
#60 0x00002d839551da39 in ?? ()
#61 0x00007ffcb7ffa050 in ?? ()
#62 0x00007ffc804cd9b3 in ?? ()
#63 0x00003704778466d1 in ?? ()
#64 0x0000370477846db1 in ?? ()
#65 0x0000000000000022 in ?? ()
#66 0x0000370477846bd1 in ?? ()
#67 0x00003704778466d1 in ?? ()
#68 0x00002d839551e109 in ?? ()
#69 0x0000370477846db1 in ?? ()
#70 0x00002d839551e109 in ?? ()
#71 0x0000370477845a51 in ?? ()
#72 0x0000370477845b71 in ?? ()
#73 0x0000370477846d59 in ?? ()
#74 0x000035b75c5c3db9 in ?? ()
#75 0x0000000000000002 in ?? ()
#76 0x000035b75c5c3df1 in ?? ()
#77 0x00002d839551e109 in ?? ()
#78 0x00007ffcb7ffa098 in ?? ()
#79 0x00007ffc9fe4c9c3 in ?? ()
#80 0x000021d3fb382501 in ?? ()
#81 0x000019172f7792c1 in ?? ()
#82 0x00007ffc9ff2bc8b in ?? ()
#83 0x00007ffcd80a8220 in ?? ()
#84 0x0000000000000002 in ?? ()
#85 0x0000370477845fd1 in ?? ()
#86 0x0000370477845fa9 in ?? ()
#87 0x00007ffcb7ffa0d0 in ?? ()
#88 0x00007ffc9ff2b275 in ?? ()
#89 0x000030689c5c1189 in ?? ()
#90 0x00003704778466d1 in ?? ()
#91 0x0000370477845fa9 in ?? ()
#92 0x00003e9da8d80069 in ?? ()
#93 0x0000000000000022 in ?? ()
#94 0x00007ffcb7ffa138 in ?? ()
#95 0x00007ffc9fe3c919 in ?? ()
#96 0x00007ff9b0026050 in ?? ()
#97 0x00007ffca02a3df0 in ?? ()
#98 0x0000000000000054 in ?? ()
#99 0x00007ffcd80ff260 in ?? ()
#100 0x0000000000000054 in ?? ()
#101 0x00003e9da8d80069 in ?? ()
#102 0x0000370477845fa9 in ?? ()
#103 0x0000000000000001 in ?? ()
#104 0x000030689c5c1231 in ?? ()
#105 0x00007ffca0016120 in ?? ()
#106 0x0000000000000022 in ?? ()
#107 0x00007ffcb7ffa1a0 in ?? ()
#108 0x00007ffc9fe0b403 in ?? ()
#109 0x0000000000000000 in ?? ()

Interesting enough it seems to come from

3 0x00007ffd6f259dac in ::drop (self=0x7ffcb7ff9ad0) at src/bindgen_runtime/js_values/buffer.rs:346

We have the actual cause of the segfault now?

BlobMaster41 commented 1 week ago

@Brooooooklyn I think im on something:

https://github.com/btc-vision/napi3-rs/commit/cac0a906b043d368b7f929433ca9d1b92a025caa

It hasnt crashed since ~30minutes im spamming a bunch of drop and ref in threads. Please take a look.

BlobMaster41 commented 1 week ago

anddddddddddd boom after a LOT of patching now it PANIC no longer SEGFAULT.

Panic occurred: PanicHookInfo { payload: Any { .. }, location: Location { file: "/root/.cargo/git/checkouts/napi3-rs-870ea236a9e912ac/c196918/crates/napi/src/bindgen_runtime/js_values/buffer.rs", line: 29, col: 7 }, can_unwind: true, force_no_backtrace: false }

here:

#[cfg(all(debug_assertions, not(windows)))]
#[inline]
pub fn register_backing_ptr(ptr: *mut u8) {
  if ptr.is_null() {
    return;
  } // 0-length buffers use NULL
  BUFFER_DATA.with(|buffer_data| {
    let mut set = buffer_data.lock().unwrap();
    if !set.insert(ptr) {
      panic!(
        "Share the same data between different buffers is not allowed, \
                    see: https://github.com/nodejs/node/issues/32463#issuecomment-631974747"
      );
    }
  });
}
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7a79859 in __GI_abort () at abort.c:79
#2  0x00007ffd4da0e3fa in std::sys::pal::unix::abort_internal () at library/std/src/sys/pal/unix/mod.rs:367
#3  0x00007ffd4da0b21f in std::panicking::rust_panic () at library/std/src/rt.rs:50
#4  0x00007ffd4da0afd2 in std::panicking::rust_panic_with_hook () at library/std/src/panicking.rs:856
#5  0x00007ffd4da0ac46 in std::panicking::begin_panic_handler::{{closure}} () at library/std/src/panicking.rs:697
#6  0x00007ffd4da09959 in std::sys::backtrace::__rust_end_short_backtrace () at library/std/src/sys/backtrace.rs:168
#7  0x00007ffd4da0a90d in rust_begin_unwind () at library/std/src/panicking.rs:695
#8  0x00007ffd4ce9be90 in core::panicking::panic_fmt () at library/core/src/panicking.rs:75
#9  0x00007ffd4d0585fc in napi::bindgen_runtime::js_values::buffer::register_backing_ptr::{{closure}} (buffer_data=0x7ffcd8152a90) at src/bindgen_runtime/js_values/buffer.rs:29
#10 0x00007ffd4d05057c in std::thread::local::LocalKey<T>::try_with (self=0x7ffd4dd7fbc0, f=...) at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/std/src/thread/local.rs:310
#11 0x00007ffd4d050234 in std::thread::local::LocalKey<T>::with (self=0x7ffd4dd7fbc0, f=...) at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/std/src/thread/local.rs:274
#12 0x00007ffd4cf12e73 in napi::bindgen_runtime::js_values::buffer::register_backing_ptr (ptr=0x7ffa04008b50 "cs\235\\") at /root/.cargo/git/checkouts/napi3-rs-870ea236a9e912ac/c196918/crates/napi/src/bindgen_runtime/js_values/buffer.rs:26
#13 0x00007ffd4ce9cc0b in napi::bindgen_runtime::js_values::buffer::BufferSlice::copy_from (env=0x7ffcefff9448, data=...) at /root/.cargo/git/checkouts/napi3-rs-870ea236a9e912ac/c196918/crates/napi/src/bindgen_runtime/js_values/buffer.rs:233
#14 0x00007ffd4cf24bdc in <op_vm::domain::runner::exit_data::ExitData as napi::bindgen_runtime::js_values::ToNapiValue>::to_napi_value (env_raw=0x7ffcd808ed00, val=...) at src/domain/runner/exit_data.rs:40
#15 0x00007ffd4cf484b5 in napi::env::Env::spawn_future::{{closure}} (env=0x7ffcd808ed00, val=<error reading variable: Cannot access memory at address 0x0>) at /root/.cargo/git/checkouts/napi3-rs-870ea236a9e912ac/c196918/crates/napi/src/env.rs:1166
#16 0x00007ffd4cf59ed0 in napi::tokio_runtime::SendableResolver<Data,R>::resolve (self=..., env=0x7ffcd808ed00, data=...) at /root/.cargo/git/checkouts/napi3-rs-870ea236a9e912ac/c196918/crates/napi/src/tokio_runtime.rs:193
#17 0x00007ffd4cf59c0b in napi::tokio_runtime::execute_tokio_future::{{closure}}::{{closure}} (env=...) at /root/.cargo/git/checkouts/napi3-rs-870ea236a9e912ac/c196918/crates/napi/src/tokio_runtime.rs:234
#18 0x00007ffd4cee2e68 in napi::js_values::deferred::napi_resolve_deferred::{{closure}} (resolver=...) at /root/.cargo/git/checkouts/napi3-rs-870ea236a9e912ac/c196918/crates/napi/src/js_values/deferred.rs:247
#19 0x00007ffd4cfc180b in core::result::Result<T,E>::and_then (self=..., op=...) at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/core/src/result.rs:1353
#20 0x00007ffd4cee20f1 in napi::js_values::deferred::napi_resolve_deferred (env=0x7ffcd808ed00, _js_callback=0x0, context=0x7ffcd8e3b690, data=0x7fc784002200)
    at /root/.cargo/git/checkouts/napi3-rs-870ea236a9e912ac/c196918/crates/napi/src/js_values/deferred.rs:245
#21 0x0000000000f111c9 in v8impl::(anonymous namespace)::ThreadSafeFunction::AsyncCb(uv_async_s*) ()
#22 0x0000000001ca72f3 in uv__async_io (loop=0x7ffcefffe9c8, w=<optimized out>, events=<optimized out>) at ../deps/uv/src/unix/async.c:176
#23 0x0000000001cbce64 in uv__io_poll (loop=loop@entry=0x7ffcefffe9c8, timeout=<optimized out>) at ../deps/uv/src/unix/linux.c:1564
#24 0x0000000001ca8017 in uv_run (loop=0x7ffcefffe9c8, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:458
#25 0x0000000000e526d6 in node::SpinEventLoopInternal(node::Environment*) ()
#26 0x000000000109ab47 in node::worker::Worker::Run() ()
#27 0x000000000109acf9 in node::worker::Worker::StartThread(v8::FunctionCallbackInfo<v8::Value> const&)::{lambda(void*)#1}::_FUN(void*) ()
#28 0x00007ffff7c51609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#29 0x00007ffff7b76353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
BlobMaster41 commented 1 week ago

new segfault again... it's so random..

   #0  0x0000000000ef9fdf in std::pair<std::__detail::_Node_iterator<v8impl::RefTracker*, true, false>, bool> std::_Hashtable<v8impl::RefTracker*, v8impl::RefTracker*, std::allocator<v8impl::RefTracker*>, std::__detail::_Identity, std::equal_to<v8impl::RefTracker*>, std::hash<v8impl::RefTracker*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, true, true> >::_M_emplace<v8impl::RefTracker*&>(std::integral_constant<bool, true>, v8impl::RefTracker*&) ()
#1  0x0000000000f16d4f in node_napi_env__::EnqueueFinalizer(v8impl::RefTracker*) ()
#2  0x0000000000ef95b6 in node_api_post_finalizer ()
#3  0x00007ffd4f25ab7f in <napi::bindgen_runtime::js_values::buffer::Buffer as core::ops::drop::Drop>::drop (self=0x7ed4e49df910) at src/bindgen_runtime/js_values/buffer.rs:404
#4  0x00007ffd4f277a6b in core::ptr::drop_in_place<napi::bindgen_runtime::js_values::buffer::Buffer> () at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/core/src/ptr/mod.rs:523
#5  0x00007ffd4f13ee46 in <op_vm::interfaces::napi::external_functions::generic_external_function::GenericExternalFunction<napi::bindgen_runtime::js_values::promise::Promise<napi::bindgen_runtime::js_values::buffer::Buffer>> as op_vm::interfaces::napi::external_functions::external_function::ExternalFunction>::execute::{{closure}} () at src/interfaces/napi/external_functions/generic_external_function.rs:76
#6  0x00007ffd4f174bae in tokio::runtime::park::CachedParkThread::block_on::{{closure}} () at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/park.rs:284
#7  0x00007ffd4f16ef97 in tokio::task::coop::with_budget (budget=..., f=...) at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/task/coop/mod.rs:167
#8  tokio::task::coop::budget (f=...) at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/task/coop/mod.rs:133
#9  tokio::runtime::park::CachedParkThread::block_on (self=0x7ed4e49dfd87, f=...) at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/park.rs:284
#10 0x00007ffd4f1a9c82 in tokio::runtime::context::blocking::BlockingRegionGuard::block_on (self=0x7ed4e49e0000, f=...) at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/context/blocking.rs:66
#11 0x00007ffd4f13e2a1 in tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}} (blocking=0x7ed4e49e0000) at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/scheduler/multi_thread/mod.rs:87
#12 0x00007ffd4f1af819 in tokio::runtime::context::runtime::enter_runtime (handle=0x7ffcd887d550, allow_block_in_place=true, f=...) at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/context/runtime.rs:65
#13 0x00007ffd4f13df50 in tokio::runtime::scheduler::multi_thread::MultiThread::block_on (self=0x7ffcd887d528, handle=0x7ffcd887d550, future=...)
    at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/scheduler/multi_thread/mod.rs:86
#14 0x00007ffd4f10bde7 in tokio::runtime::runtime::Runtime::block_on_inner (self=0x7ffcd887d520, future=...) at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/runtime.rs:358
#15 0x00007ffd4f10d8ae in tokio::runtime::runtime::Runtime::block_on (self=0x7ffcd887d520, future=...) at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/runtime.rs:330
#16 0x00007ffd4f13d0bf in <op_vm::interfaces::napi::external_functions::generic_external_function::GenericExternalFunction<napi::bindgen_runtime::js_values::promise::Promise<napi::bindgen_runtime::js_values::buffer::Buffer>> as op_vm::interfaces::napi::external_functions::external_function::ExternalFunction>::execute (self=0x7ffcd8156dc8, data=..., runtime=0x7ffcd887d520) at src/interfaces/napi/external_functions/generic_external_function.rs:78
#17 0x00007ffd4f119654 in <op_vm::interfaces::napi::external_functions::storage_load_external_function::StorageLoadExternalFunction as op_vm::interfaces::napi::external_functions::external_function::ExternalFunction>::execute (self=0x7ffcd8156dc8,
    data=..., runtime=0x7ffcd887d520) at src/interfaces/napi/external_functions/storage_load_external_function.rs:45
#18 0x00007ffd4f1b3351 in op_vm::domain::runner::import_functions::storage_load_import::StorageLoadImport::execute (context=..., key_ptr=18352, result_ptr=37936) at src/domain/runner/import_functions/storage_load_import.rs:31
#19 0x00007ffd4f0f351a in core::ops::function::Fn::call () at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/core/src/ops/function.rs:79
#20 0x00007ffd4f1d51a0 in wasmer::backend::sys::entities::function::gen_fn_callback_s2::func_wrapper::{{closure}}::{{closure}} () at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/wasmer-6.0.1/src/backend/sys/entities/function/mod.rs:600
#21 0x00007ffd4f0f5e49 in core::ops::function::FnOnce::call_once () at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/core/src/ops/function.rs:250
#22 0x00007ffd4f141601 in <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (self=...) at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/core/src/panic/unwind_safe.rs:272
#23 0x00007ffd4f1a3f84 in std::panicking::try::do_call (data=0x7ed4e49e0ed0 "\300\250\032L\375\177") at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/std/src/panicking.rs:587
#24 0x00007ffd4f17bd7b in __rust_try () from /root/op-vm/op-vm.linux-x64-gnu.node
#25 0x00007ffd4f17b8ca in std::panicking::try (f=...) at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/std/src/panicking.rs:550
#26 std::panic::catch_unwind (f=...) at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/std/src/panic.rs:358
#27 0x00007ffd4f1d4663 in wasmer::backend::sys::entities::function::gen_fn_callback_s2::func_wrapper::{{closure}} () at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/wasmer-6.0.1/src/backend/sys/entities/function/mod.rs:591
#28 0x00007ffd4f1f72b0 in wasmer_vm::trap::traphandlers::on_host_stack::{{closure}} () at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/wasmer-vm-6.0.1/src/trap/traphandlers.rs:1015
#29 0x00007ffd4f140928 in <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (self=...) at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/core/src/panic/unwind_safe.rs:272
#30 0x00007ffd4f1a4e1a in std::panicking::try::do_call (data=0x7ed4e49e10c8 "\300\250\032L\375\177") at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/std/src/panicking.rs:587
#31 0x00007ffd4f17bd7b in __rust_try () from /root/op-vm/op-vm.linux-x64-gnu.node
#32 0x00007ffd4f17a88e in std::panicking::try (f=...) at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/std/src/panicking.rs:550
#33 std::panic::catch_unwind (f=...) at /rustc/05f9846f893b09a1be1fc8560e33fc3c815cfecb/library/std/src/panic.rs:358
#34 0x00007ffd4f183f88 in corosensei::unwind::catch_unwind_at_root (f=...) at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/corosensei-0.2.1/src/unwind.rs:228
#35 0x00007ffd4f19a40b in corosensei::coroutine::on_stack::wrapper (ptr=0x7ffd4c1aa660 "\300\250\032L\375\177") at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/corosensei-0.2.1/src/coroutine.rs:568
#36 <signal handler called>
#37 0x00007ffd4f96327f in corosensei::arch::x86_64::on_stack (arg=0x7ffd4c1aa660 "\300\250\032L\375\177", stack=..., f=0x7ffd4f19a3b0 <corosensei::coroutine::on_stack::wrapper>)
    at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/corosensei-0.2.1/src/unwind.rs:137
#38 0x00007ffd4f199b83 in corosensei::coroutine::on_stack (stack=..., f=...) at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/corosensei-0.2.1/src/coroutine.rs:581
#39 0x00007ffd4f198bc0 in corosensei::coroutine::Yielder<Input,Yield>::on_parent_stack (self=0x7ffd4c1aaff0, f=...) at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/corosensei-0.2.1/src/coroutine.rs:535
#40 0x00007ffd4f1f58c2 in wasmer_vm::trap::traphandlers::on_host_stack (f=...) at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/wasmer-vm-6.0.1/src/trap/traphandlers.rs:1013
#41 0x00007ffd4f1d3ea7 in wasmer::backend::sys::entities::function::gen_fn_callback_s2::func_wrapper (env=0x7ffcd8902ac0, A1=18352, A2=37936)
    at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/wasmer-6.0.1/src/backend/sys/entities/function/mod.rs:590
#42 0x00007ffeac0ba093 in ?? ()
#43 0x0000000000000000 in ?? ()