Closed larskanis closed 3 months ago
I've seen similar failures in Nokogiri's test suite, though it appears to be sporadic for me and for an earlier version of TR. I'm happy to open a separate bug report, but the stack trace is so similar I thought I'd start here.
Example test output: https://github.com/sparklemotion/nokogiri/actions/runs/8194425716/job/22410492610#step:7:114
Version: truffleruby 23.1.2, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
dead handle 0xbad000000023028 (com.oracle.truffle.api.CompilerDirectives.ShouldNotReachHere)
from com.oracle.truffle.api.CompilerDirectives.shouldNotReachHere(CompilerDirectives.java:574)
from com.oracle.truffle.api.CompilerDirectives.shouldNotReachHere(CompilerDirectives.java:520)
from org.truffleruby.cext.UnwrapNode$UnwrapNativeNode.raiseError(UnwrapNode.java:107)
from org.truffleruby.cext.UnwrapNode$UnwrapNativeNode.unwrapTaggedObject(UnwrapNode.java:92)
from org.truffleruby.cext.UnwrapNodeGen$UnwrapNativeNodeGen$Inlined.executeAndSpecialize(UnwrapNodeGen.java:421)
from org.truffleruby.cext.UnwrapNodeGen$UnwrapNativeNodeGen$Inlined.execute(UnwrapNodeGen.java:387)
from org.truffleruby.cext.UnwrapNode.longToWrapper(UnwrapNode.java:270)
from org.truffleruby.cext.UnwrapNodeGen$Inlined.executeAndSpecialize(UnwrapNodeGen.java:183)
from org.truffleruby.cext.UnwrapNodeGen$Inlined.execute(UnwrapNodeGen.java:1[58](https://github.com/sparklemotion/nokogiri/actions/runs/8194425716/job/22410492610#step:7:59))
from org.truffleruby.cext.CExtNodes$CallWithCExtLockAndFrameAndUnwrapNode.callWithCExtLockAndFrame(CExtNodes.java:258)
from org.truffleruby.cext.CExtNodesFactory$CallWithCExtLockAndFrameAndUnwrapNodeFactory$CallWithCExtLockAndFrameAndUnwrapNodeGen.executeAndSpecialize(CExtNodesFactory.java:577)
from org.truffleruby.cext.CExtNodesFactory$CallWithCExtLockAndFrameAndUnwrapNodeFactory$CallWithCExtLockAndFrameAndUnwrapNodeGen.execute(CExtNodesFactory.java:556)
from org.truffleruby.language.locals.WriteLocalVariableNode.execute(WriteLocalVariableNode.java:28)
from org.truffleruby.language.RubyNode.doExecuteVoid(RubyNode.java:64)
from org.truffleruby.language.control.SequenceNode.execute(SequenceNode.java:34)
from org.truffleruby.core.module.ModuleNodes$DefineMethodNode$CallMethodWithLambdaBody.execute(ModuleNodes.java:1373)
from org.truffleruby.language.RubyLambdaRootNode.execute(RubyLambdaRootNode.java:84)
/home/runner/.rubies/truffleruby-23.1.2/lib/truffle/truffle/cext_ruby.rb:23:in `parent'
from /home/runner/work/nokogiri/nokogiri/test/xml/test_node_set.rb:[60](https://github.com/sparklemotion/nokogiri/actions/runs/8194425716/job/22410492610#step:7:61)2:in `block (4 levels) in <class:TestNodeSet>'
from /home/runner/work/nokogiri/nokogiri/lib/nokogiri/xml/node_set.rb:237:in `block in each'
from /home/runner/work/nokogiri/nokogiri/lib/nokogiri/xml/node_set.rb:236:in `upto'
from /home/runner/work/nokogiri/nokogiri/lib/nokogiri/xml/node_set.rb:236:in `each'
dead handle 0xbad... (com.oracle.truffle.api.CompilerDirectives.ShouldNotReachHere)
usually means that some VALUE
field was not marked properly by a C extension. TruffleRuby may call the marking functions more often or at different places than CRuby, CRuby calls them during GC and that's not possible on JVM so TruffleRuby calls them e.g. after returning from a method defined in C using DATA_PTR
.
So it could be some field missed to be marked in the marking function, or it could be a missing RB_GC_GUARD
.
The Ruby part of the backtrace usually gives some hint about which field/local variable it is about.
Since 24.0 native extensions are executed natively which means all VALUE
variables are handles while before only those that escape to the native heap not managed by Sulong (e.g. a malloc or passed to some system library), so it makes it more likely to discover such issues.
Interesting. The two most recent Nokogiri errors I have found ruby backtraces to tests that deal with duplicating nodes:
/home/runner/.rubies/truffleruby-23.1.2/lib/truffle/truffle/cext_ruby.rb:23:in `parent'
from /home/runner/work/nokogiri/nokogiri/test/xml/test_node_set.rb:602:in `block (4 levels) in <class:TestNodeSet>'
from /home/runner/work/nokogiri/nokogiri/lib/nokogiri/xml/node_set.rb:237:in `block in each'
from /home/runner/work/nokogiri/nokogiri/lib/nokogiri/xml/node_set.rb:236:in `upto'
from /home/runner/work/nokogiri/nokogiri/lib/nokogiri/xml/node_set.rb:236:in `each'
from /home/runner/work/nokogiri/nokogiri/test/xml/test_node_set.rb:601:in `test_0002_wraps each node within a dup of the Node argument'
and
/home/runner/.rubies/truffleruby-head/lib/truffle/truffle/cext_ruby.rb:24:in `[]'
from /home/runner/work/nokogiri/nokogiri/lib/nokogiri/xml/node_set.rb:237:in `block in each'
from /home/runner/work/nokogiri/nokogiri/lib/nokogiri/xml/node_set.rb:236:in `upto'
from /home/runner/work/nokogiri/nokogiri/lib/nokogiri/xml/node_set.rb:236:in `each'
from <internal:core> core/enumerable.rb:594:in `any?'
from /home/runner/work/nokogiri/nokogiri/lib/nokogiri/xml/document_fragment.rb:103:in `css'
from /home/runner/work/nokogiri/nokogiri/lib/nokogiri/xml/searchable.rb:144:in `at_css'
from /home/runner/work/nokogiri/nokogiri/test/xml/test_document_fragment.rb:304:in `test_dup_creates_mutable_tree'
I'll take a deeper look when I get a chance.
Observed a very similar error for google-protobuf
on truffleruby+graalvm-24.0.0
.
What's interesting is that it is throwing from just loading the cext with require 'google/protobuf_c'
: https://github.com/protocolbuffers/protobuf/blob/v26.0/ruby/lib/google/protobuf_native.rb#L15
dead handle 0xbad000000018070 (com.oracle.truffle.api.CompilerDirectives.ShouldNotReachHere)
from com.oracle.truffle.api.CompilerDirectives.shouldNotReachHere(CompilerDirectives.java:574)
from com.oracle.truffle.api.CompilerDirectives.shouldNotReachHere(CompilerDirectives.java:520)
from org.truffleruby.cext.UnwrapNode$UnwrapNativeNode.raiseError(UnwrapNode.java:107)
from org.truffleruby.cext.UnwrapNode$UnwrapNativeNode.unwrapTaggedObject(UnwrapNode.java:92)
from org.truffleruby.cext.UnwrapNodeGen$UnwrapNativeNodeGen$Inlined.execute(UnwrapNodeGen.java:377)
from org.truffleruby.cext.UnwrapNode.longToWrapper(UnwrapNode.java:270)
from org.truffleruby.cext.UnwrapNodeGen$Inlined.execute(UnwrapNodeGen.java:143)
from org.truffleruby.cext.ValueWrapperManager$UnwrapperFunction.execute(ValueWrapperManager.java:401)
from org.truffleruby.cext.UnwrapperFunctionGen$InteropLibraryExports$Cached.execute(UnwrapperFunctionGen.java:117)
from com.oracle.truffle.llvm.runtime.nodes.func.LLVMDispatchNode$LLVMLookupDispatchForeignNode.doGeneric(LLVMDispatchNode.java:459)
from com.oracle.truffle.llvm.runtime.nodes.func.LLVMDispatchNode$LLVMLookupDispatchForeignNode.doUnknownType(LLVMDispatchNode.java:487)
from com.oracle.truffle.llvm.runtime.nodes.func.LLVMDispatchNodeGen$LLVMLookupDispatchForeignNodeGen.execute(LLVMDispatchNodeGen.java:1471)
from com.oracle.truffle.llvm.runtime.nodes.func.LLVMDispatchNode.doForeignExecutable(LLVMDispatchNode.java:380)
from com.oracle.truffle.llvm.runtime.nodes.func.LLVMDispatchNodeGen.executeDispatch(LLVMDispatchNodeGen.java:272)
from com.oracle.truffle.llvm.runtime.nodes.func.LLVMCallNode.doCall(LLVMCallNode.java:82)
from com.oracle.truffle.llvm.runtime.nodes.func.LLVMCallNodeGen.executeGeneric(LLVMCallNodeGen.java:37)
from com.oracle.truffle.llvm.runtime.nodes.api.LLVMFrameNullerExpression.doGeneric(LLVMFrameNullerExpression.java:71)
from com.oracle.truffle.llvm.runtime.nodes.api.LLVMFrameNullerExpressionNodeGen.executeGeneric(LLVMFrameNullerExpressionNodeGen.java:29)
from com.oracle.truffle.llvm.runtime.nodes.vars.LLVMWriteNodeFactory$LLVMWritePointerNodeGen.execute_generic1(LLVMWriteNodeFactory.java:1370)
from com.oracle.truffle.llvm.runtime.nodes.vars.LLVMWriteNodeFactory$LLVMWritePointerNodeGen.execute(LLVMWriteNodeFactory.java:1344)
from com.oracle.truffle.llvm.runtime.nodes.base.LLVMBasicBlockNode$InitializedBlockNode.execute(LLVMBasicBlockNode.java:154)
from com.oracle.truffle.llvm.runtime.nodes.control.LLVMDispatchBasicBlockNode.dispatchFromBasicBlock(LLVMDispatchBasicBlockNode.java:116)
from com.oracle.truffle.llvm.runtime.nodes.control.LLVMDispatchBasicBlockNode.doDispatch(LLVMDispatchBasicBlockNode.java:87)
from com.oracle.truffle.llvm.runtime.nodes.control.LLVMDispatchBasicBlockNodeGen.executeGeneric(LLVMDispatchBasicBlockNodeGen.java:33)
from com.oracle.truffle.llvm.runtime.nodes.control.LLVMFunctionRootNode.doRun(LLVMFunctionRootNode.java:81)
from com.oracle.truffle.llvm.runtime.nodes.control.LLVMFunctionRootNodeGen.executeGeneric(LLVMFunctionRootNodeGen.java:34)
from com.oracle.truffle.llvm.runtime.nodes.func.LLVMFunctionStartNode.execute(LLVMFunctionStartNode.java:102)
/home/runner/.rubies/truffleruby+graalvm-24.0.0/lib/truffle/truffle/cext.rb:2248:in `block in resolve_registered_addresses'
from /home/runner/.rubies/truffleruby+graalvm-24.0.0/lib/truffle/truffle/cext.rb:2247:in `each'
from /home/runner/.rubies/truffleruby+graalvm-24.0.0/lib/truffle/truffle/cext.rb:2247:in `resolve_registered_addresses'
from /home/runner/.rubies/truffleruby+graalvm-24.0.0/lib/truffle/truffle/cext.rb:220:in `init_extension'
from <internal:core> core/kernel.rb:229:in `gem_original_require'
from <internal:/home/runner/.rubies/truffleruby+graalvm-24.0.0/lib/mri/rubygems/core_ext/kernel_require.rb>:37:in `require'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/vendor/bundle/truffleruby/3.2.2.24.0.0.2/gems/google-protobuf-4.26.0/lib/google/protobuf_native.rb:15:in `<top (required)>'
from <internal:core> core/kernel.rb:229:in `gem_original_require'
from <internal:/home/runner/.rubies/truffleruby+graalvm-24.0.0/lib/mri/rubygems/core_ext/kernel_require.rb>:37:in `require'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/vendor/bundle/truffleruby/3.2.2.24.0.0.2/gems/google-protobuf-4.26.0/lib/google/protobuf.rb:57:in `<module:Protobuf>'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/vendor/bundle/truffleruby/3.2.2.24.0.0.2/gems/google-protobuf-4.26.0/lib/google/protobuf.rb:15:in `<module:Google>'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/vendor/bundle/truffleruby/3.2.2.24.0.0.2/gems/google-protobuf-4.26.0/lib/google/protobuf.rb:14:in `<top (required)>'
from <internal:core> core/kernel.rb:229:in `gem_original_require'
from <internal:/home/runner/.rubies/truffleruby+graalvm-24.0.0/lib/mri/rubygems/core_ext/kernel_require.rb>:37:in `require'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/ext/sass/embedded_sass_pb.rb:5:in `<top (required)>'
from <internal:core> core/kernel.rb:292:in `require_relative'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/lib/sass/embedded_protocol.rb:6:in `<module:EmbeddedProtocol>'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/lib/sass/embedded_protocol.rb:5:in `<module:Sass>'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/lib/sass/embedded_protocol.rb:3:in `<top (required)>'
from <internal:core> core/kernel.rb:292:in `require_relative'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/lib/sass/compiler.rb:11:in `<top (required)>'
from <internal:core> core/kernel.rb:292:in `require_relative'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/lib/sass/embedded.rb:3:in `<top (required)>'
from <internal:core> core/kernel.rb:292:in `require_relative'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/lib/sass-embedded.rb:4:in `<top (required)>'
from <internal:core> core/kernel.rb:229:in `gem_original_require'
from <internal:/home/runner/.rubies/truffleruby+graalvm-24.0.0/lib/mri/rubygems/core_ext/kernel_require.rb>:37:in `require'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/spec/spec_helper.rb:3:in `<top (required)>'
from <internal:core> core/kernel.rb:229:in `gem_original_require'
from <internal:/home/runner/.rubies/truffleruby+graalvm-24.0.0/lib/mri/rubygems/core_ext/kernel_require.rb>:37:in `require'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/spec/sass/compile_error_spec.rb:3:in `<top (required)>'
from <internal:core> core/kernel.rb:378:in `load'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/vendor/bundle/truffleruby/3.2.2.24.0.0.2/gems/rspec-core-3.13.0/lib/rspec/core/configuration.rb:2138:in `load_file_handling_errors'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/vendor/bundle/truffleruby/3.2.2.24.0.0.2/gems/rspec-core-3.13.0/lib/rspec/core/configuration.rb:1638:in `block in load_spec_files'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/vendor/bundle/truffleruby/3.2.2.24.0.0.2/gems/rspec-core-3.13.0/lib/rspec/core/configuration.rb:1636:in `each'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/vendor/bundle/truffleruby/3.2.2.24.0.0.2/gems/rspec-core-3.13.0/lib/rspec/core/configuration.rb:1636:in `load_spec_files'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/vendor/bundle/truffleruby/3.2.2.24.0.0.2/gems/rspec-core-3.13.0/lib/rspec/core/runner.rb:102:in `setup'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/vendor/bundle/truffleruby/3.2.2.24.0.0.2/gems/rspec-core-3.13.0/lib/rspec/core/runner.rb:86:in `run'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/vendor/bundle/truffleruby/3.2.2.24.0.0.2/gems/rspec-core-3.13.0/lib/rspec/core/runner.rb:71:in `run'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/vendor/bundle/truffleruby/3.2.2.24.0.0.2/gems/rspec-core-3.13.0/lib/rspec/core/runner.rb:45:in `invoke'
from /home/runner/work/sass-embedded-host-ruby/sass-embedded-host-ruby/vendor/bundle/truffleruby/3.2.2.24.0.0.2/gems/rspec-core-3.13.0/exe/rspec:4:in `<main>'
@ntkme Could you file a separate issue for that one?
Given it happens with resolve_registered_addresses
it seems quite different, and might possibly happen all the time, not transient?
Going forward, it seems best to file separate issues (one per gem) for dead handle
errors, because it's very likely to be specific to some code in the gem, and it makes it much easier to track & investigate & keep the information together.
@flavorjones Could you also file a separate issue for nokogiri, so then this one is only about ruby-pg
?
@eregon It happens rarely. For most of the time it works just fine. I created a new issue here: https://github.com/oracle/truffleruby/issues/3500
I finally got truffleruby-24.x running locally and now it looks like a bug in Truffleruby-24.0. The code is fairly simple in ruby-pg. Three Float numbers are calculated in the init of the extension like so:
s_nan = rb_eval_string("0.0/0.0");
rb_global_variable(&s_nan);
s_pos_inf = rb_eval_string("1.0/0.0");
rb_global_variable(&s_pos_inf);
s_neg_inf = rb_eval_string("'-1.0/0.0'");
rb_global_variable(&s_neg_inf);
Then in a C-func the value is returned like so:
static VALUE
pg_text_dec_float(...){
return s_neg_inf;
}
But to that time the returned Float object is no longer valid, resulting in the dead handle
error on truffleruby-24.x.
The same happens with any floating point number. But nothing crashs, when I use some other ruby object. For instance , if I use a String but do not register the global variable like so:
s_nan = rb_eval_string("'xyz'");
# rb_global_variable(&s_nan);
then Truffleruby crashs with very much the same error like with a Float object. But when I uncomment the rb_global_variable
call, that no crash happens, because the String is properly marked.
This is in contrast to Float objects. They fail with the dead handle
error regardless of the rb_global_variable
call. It also doesn't matter if I change rb_global_variable
to rb_gc_register_mark_object
or rb_define_const
. None of the seem to mark a Float object.
@larskanis Thank you for the investigation and details, this makes it a lot easier to look into it.
I can reproduce the issue reliably in a C API spec, with both bignums and floats (v = LONG2NUM(INT64_MAX);
, v = DBL2NUM(0.0/0.0);
, v = rb_eval_string("0.0/0.0")
).
The issue for the Float case seems that the ValueWrapper is not kept alive, after the Init_
function has returned.
And when that GC's we lose the mapping from native address to the Float instance (a java.lang.Double
).
rb_global_variable()
etc add the relevant objects (e.g. the Float instance) in GC_REGISTERED_ADDRESSES
. For regular Ruby objects they hold onto their ValueValue wrapper too, so that works fine, but for primitives like Float there is no way to store the ValueWrapper in a java.lang.Double
instance.
So I think GC_REGISTERED_ADDRESSES
should hold onto ValueWrapper's instead of the actual objects they refer to and that should fix it.
For bignums like INT64_MAX
this actually fits in a Java long
so is the same case as Float (cannot store a ValueWrapper in a java.lang.Long
instance).
For "true" bignums that don't fit in a Java long
it works fine already and it's like e.g. Symbols.
Fixnums are not affected because those are tagged pointers like in CRuby so VALUE->long is done only from the VALUE address and not needing any ValueWrapper/HandleBlock/etc.
true/false/nil/Qundef
are fine, those wrappers are always held alive and have special addresses.
And there are no other kinds of "primitives", the rest is all RubyDynamicObject/ImmutableRubyObject.
A quick workaround until this is fixed is to run with TRUFFLERUBYOPT="--experimental-options --keep-handles-alive"
.
That will keep all handles (VALUE
) alive by leaking them, so obviously just a workaround but could be useful in CI until this is fixed.
We should have a fix very soon hopefully.
I've opened a new issue for the nokogiri errors above at https://github.com/oracle/truffleruby/issues/3503
This fix should be included for the 24.0.1 Release (Apr 16, 2024).
(and of course it's fixed on master and in truffleruby-dev/head
)
It happens at every run here in the pg specs when returning a float NaN value.
This error is raised since several weeks in truffleruby-head "24.1.0-dev-3a920de7, like ruby 3.2.2, GraalVM CE Native [x86_64-linux]". It doesn't happen in truffleruby "23.1.2, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]".
Here is a failing CI run: https://github.com/ged/ruby-pg/actions/runs/8115487309/job/22183459895#step:12:458
The output: