Closed seanmakesgames closed 8 years ago
Is this still happening on latest, cause my gvl detection code was way off...
I'm doing a production push tomorrow with a handful of concurrency fixes in ruby and latest mini_racer. Some of the ruby invalid state will go away. Should know if the other stuff is happening by end of day pacific time :)
looks like the ruby weirdness is fixed now (I was not handling FailedV8Conversion correctly in my procs)
The segfaults remain.
Here's the line of code they are happening on:
DatabaseConnection.client.database[collection_name.to_sym]
collection_name is a string from JS
the client is a Mongo ruby driver Client which is shared by all the calls in the worker process.
spoke too soon. The other thing is still happening as well. Just got an instance
One of my players did a code review of the c code (no experience with ruby c api) and had the following comments. (Breaking these up into separate comment blocks)
-----BEGIN COMMENT 1------
result = rb_rescue((VALUE(*)(...))&protected_callback, (VALUE)(&callback_data),
(VALUE(*)(...))&rescue_callback, (VALUE)(&callback_data));
protectedcallback has type VALUE()(VALUE rdata) and rescuecallback has type VALUE()(VALUE rdata, VALUE exception)
Both are being cast to a function of type VALUE(*)(...)
These casts cause undefined behavior. Pointers to functions of different types are completely incompatible with each other, and casting function pointers like that is undefined behavior. varargs functions in particular are extremely finicky beasts that have completely different calling conventions and semantics to normal functions. Note also that pointers to functions cannot be cast to void* (function pointers might be a different size than data pointers and void* is only required to be compatible with char* if memory serves)
There are more casts to varargs pointers at the bottom of the file in Init_mini_racer_extension.
-----END COMMENT 1-----
My quick googling into these comments yields: http://stackoverflow.com/questions/11731159/in-c-is-it-ever-safe-to-cast-a-variadic-function-pointer-to-a-function-pointer
https://silverhammermba.github.io/emberb/c/#rescue
VALUE exception = rb_errinfo(); /* get last exception */
rb_set_errinfo(Qnil); /* clear last exception */
----BEGIN COMMENT 2----
I saw a couple instances of this:
isolate->SetData(0, (void*)false);
or
isolate->SetData(0, (void*)true);
where that is later retrieved as:
bool has_gvl = (bool)args.GetIsolate()->GetData(0);
This will PROBABLY work, but it also is technically undefined behavior. Although casting from a bool to a void* is specifically allowed (bools are integral types and any integral types are allowed to be cast to pointers), behavior is only defined when casting a pointer to an integral type of the same size or larger than the pointer. So, if the bool is 1 byte (it could also be 4) and the pointer is 4 bytes (it could also be 8), the cast from void* to bool is technically undefined behavior. I expect most compilers will do the right thing, but they could conceivably implement the cast from bool to pointer as setting a 1 bit in the HIGH order byte of the pointer, and the cast from pointer to bool by taking just the low byte (and thereby missing the bit they set). So, if this ever looks like it is doing the wrong thing, it could well be. There are zero guarantees that those will round trip.
----END COMMENT 2----
----BEGIN COMMENT 3---- THIS is also technically undefined behavior but almost certainly alright:
static void ruby_callback(const FunctionCallbackInfo<Value>& args) {
bool has_gvl = (bool)args.GetIsolate()->GetData(0);
if(has_gvl) {
gvl_ruby_callback((void*)&args);
} else {
rb_thread_call_with_gvl(gvl_ruby_callback, (void*)(&args));
}
}
the argument args is a reference, and then in the gvl_ruby_callback line, there is &args -- get the address of args. There is no guarantee that references have storage -- they could be handled as a pointer in a register or even completely optimized away -- so there is no guarantee that the address-of operator works on references. It almost certainly will, but again, it isn't guaranteed to as far as I can tell based on the spec. (even if it does have storage, I do not think, but can't find any help in the spec, that it is guaranteed that the &args is the same address as the object that was passed in as an argument. Thirdly, it is a const reference. I'm unsure exactly what is in a FunctionCallbackInfo, but if gvl_ruby_callback can modify it, this might be violating the method signature (the cast to void* suppresses the warning), so that might also cause issues.
----END COMMENT 3----
----BEGIN COMMENT 4----
I am a little concerned that it is POSSIBLE that the refs_count on IsolateInfo isn't initialized. There are a few places where it looks like it is allocated in ruby-land (??), and if that is garbage data, it could be anything. 0, negative, positive...
----END COMMENT 4----
ALL of this code works MOST of the time, so I don't think there's anything mostly wrong with any of it.
Unfortunately, my use-case puts mini_racer THROUGH THE RINGER. (100% usage for many-many hours every day; w/ some of the most arcane and awful JS you can imagine.)
Here are some of the exceptions in ruby, since I didn't post them before:
undefined method 'to_sym' for #<Array:0x000000071f92d8> (NoMethodError)
undefined method 'to_sym' for [2016-07-09 14:50:38 -0400, 1000, "rtd", "trust", "market.sell"]:Array (NoMethodError)
undefined method 'to_sym' for ["tier", 1]:Array (NoMethodError)
bundle/ruby/2.2.0/gems/activesupport-4.2.6/lib/active_support/core_ext/object/json.rb:34:in 'to_json': source sequence is illegal/malformed utf-8 (JSON::GeneratorError)
bundle/ruby/2.2.0/bundler/gems/mini_racer-5f3641babe30/lib/mini_racer.rb:162:in 'eval_unsafe': String W�m���� �Ad@ is not valid UTF-8. (ArgumentError)
I'm calling #to_sym on collection_name which is a passed in parameter to my proc. All of the js being run in these cases is code that is run 1000s of times a day and has a low failure rate, but very uncommonly the same exception.
The segfault from the original comment is VERY consistent, but this is unsurprising because every proc that is called from JS hits this line of code.
Would love to hear your thoughts on all of this. (really appreciate your support!)
Probably unrelated, but the compilers and the clang static analyzer moan about this assignment: https://github.com/discourse/mini_racer/blob/master/ext/mini_racer_extension/mini_racer_extension.cc#L645
Assigned value is garbage or undefined
OMG that's totally relevant!
What happens if the for loop on line 635 doesn't happen or there's an issue converting params from js? We pass the uninit args to the ruby callback-
Or if there's a timeout exception while that loop is processing-- it can't finish the js half of the conversion?
I'm not sure, but this is definitely a spot!
length is properly checked on line 597
if (data->length > 0) {
result = rb_funcall2(data->callback, rb_intern("call"), data->length, data->args);
} else {
result = rb_funcall(data->callback, rb_intern("call"), 0);
}
and data->args are not referenced, but I'm concerned about the mid-v8-conversion case (termination or otherwise). What is in the ALLOC_N data if it's not set by the for loop? (my assumption is QNil or something)
I looked at ALLOC_N and followed it all the way down to this func: http://rxr.whitequark.org/mri/source/gc.c#4664
which does not seem to init the memory. Just seems to malloc it in the proper GC space & correct size.
Not sure this is super relevant, but found it when looking up ALLOC_N: http://clalance.blogspot.com/2011/01/writing-ruby-extensions-in-c-part-12.html re: mem leaks on exception.
Just a quick one, the weekend
I would love to address every one of the issues raised, ask the reviewer if they can suggest fixes.
segfaults are happening at the following locations:
Segmentation fault at 0x00000000000030
Segmentation fault at 0x00000000000018
c-callstack:
.rvm/rubies/ruby-2.2.4/bin/../lib/libruby.so.2.2(rb_vm_bugreport+0x51f) [0x7f94ede4813f] vm_dump.c:693
.rvm/rubies/ruby-2.2.4/bin/../lib/libruby.so.2.2(rb_bug_context+0xcb) [0x7f94edcdb54b] error.c:425
.rvm/rubies/ruby-2.2.4/bin/../lib/libruby.so.2.2(sigsegv+0x3e) [0x7f94eddbbdae] signal.c:879
/lib/x86_64-linux-gnu/libc.so.6 [0x7f94ed8cdd40]
.rvm/rubies/ruby-2.2.4/bin/../lib/libruby.so.2.2(vm_search_method+0x4c) [0x7f94ede30b7c] vm_insnhelper.c:836
.rvm/rubies/ruby-2.2.4/bin/../lib/libruby.so.2.2(vm_exec_core+0x2625) [0x7f94ede38085] insns.def:1069
.rvm/rubies/ruby-2.2.4/bin/../lib/libruby.so.2.2(vm_exec+0x82) [0x7f94ede3b752] vm.c:1435
.rvm/rubies/ruby-2.2.4/bin/../lib/libruby.so.2.2(invoke_block_from_c+0x2f6) [0x7f94ede3ed96] vm.c:813
.rvm/rubies/ruby-2.2.4/bin/../lib/libruby.so.2.2(vm_invoke_proc+0xe0) [0x7f94ede3f060] vm.c:878
.rvm/rubies/ruby-2.2.4/bin/../lib/libruby.so.2.2(rb_vm_invoke_proc+0x1a) [0x7f94ede3f11a] vm.c:897
.rvm/rubies/ruby-2.2.4/bin/../lib/libruby.so.2.2(proc_call+0x52) [0x7f94edcea652] proc.c:731
.rvm/rubies/ruby-2.2.4/bin/../lib/libruby.so.2.2(vm_call0_body.constprop.79+0x28f) [0x7f94ede3d08f] vm_eval.c:127
.rvm/rubies/ruby-2.2.4/bin/../lib/libruby.so.2.2(rb_call0+0x1ba) [0x7f94ede3dcea] vm_eval.c:59
bundle/ruby/2.2.0/bundler/gems/mini_racer-5f3641babe30/lib/mini_racer_extension.so(_ZL18protected_callbackm+0x70) [0x7f94e68ea0b1] mini_racer_extension.cc:598
.rvm/rubies/ruby-2.2.4/bin/../lib/libruby.so.2.2(rb_rescue2+0xc9) [0x7f94edce3709] eval.c:808
bundle/ruby/2.2.0/bundler/gems/mini_racer-5f3641babe30/lib/mini_racer_extension.so(_Z17gvl_ruby_callbackPv+0x36e) [0x7f94e68ea49f] mini_racer_extension.cc:649
.rvm/rubies/ruby-2.2.4/bin/../lib/libruby.so.2.2(rb_thread_call_with_gvl+0x1da) [0x7f94ede569ea] thread.c:1475
bundle/ruby/2.2.0/bundler/gems/mini_racer-5f3641babe30/lib/mini_racer_extension.so(_ZL13ruby_callbackRKN2v820FunctionCallbackInfoINS_5ValueEEE+0x85) [0x7f94e68ea733] mini_racer_extension.cc:676
bundle/ruby/2.2.0/bundler/gems/mini_racer-5f3641babe30/lib/mini_racer_extension.so(_ZN2v88internal25FunctionCallbackArguments4CallEPFvRKNS_20FunctionCallbackInfoINS_5ValueEEEE+0x13f) [0x7f94e6dce01f]
bundle/ruby/2.2.0/bundler/gems/mini_racer-5f3641babe30/lib/mini_racer_extension.so [0x7f94e6959bcd]
bundle/ruby/2.2.0/bundler/gems/mini_racer-5f3641babe30/lib/mini_racer_extension.so [0x7f94e695a54f]
And in ruby, it's always in my proc on the line where I'm getting the collection from mongo
@SamSaffron The first bullet is much harder, as I have never seen this in my dev / test environment; only on production.
If I understand correctly you want something like:
var result = doWorkFunction();
(JSON.parse(JSON.stringify(result)))();
Reviewer is watching the thread, and will chime in soon, I'm sure. :D
Won't something like this work?
var __r = function(){...code to eval here...};
return JSON.parse(JSON.stringify(__r()));
Or perhaps
return JSON.parse(JSON.stringify( (function(){..code here..})() ));
also, can you ensure 100% that no context are being initialised prior to forking in unicorn. v8 is not fork safe.
wait, do you have a bit more depth on that backtrace I want to determine if it should be calling with or without the gvl
re: code -- do you want ALL of the code in the eval wrapped in an enclosure? (such that the procs are not in the same scope?
re: forking: all contexts are initialized inside of requests or in the delayed job worker process for the requests, I'm confident they are all post-fork for the worker, I'm reasonably confident that jobs are all run in the same long-lived process. ADDENDUM: all issues reference in this github-issue are happening in the delayed_job worker.
re: backtrace, you mean c-level backtrace that I provided before or a ruby one?
re:code, yeah just as a test
re:backtrace ... the c trace from the segfault is good, but does it have more frames/locations to it?
it has one more:
-- C level backtrace information ------------------------------------------- ...previous scrubbed stack [0x25511fb08ba7]
-- Other runtime information -----------------------------------------------
Just curious, what is the json-and-back-to-obj testing? Is it 'corrupt data before leaving js'?
I'm the one who reported these to @seanmakesgames.
So... first off, the comments @seanmakesgames called out as comment 2 and 3 are probably fine, they are pedantic, language-lawyer-y issues that I raised just on the off chance that misbehavior was traced to near those lines. I would be surprised if any sane compiler misbehaved in either case (then again, are any c compilers sane?), but they aren't strictly required to.
Comment 2 is relatively easy to fix, though: cast the void* to a pointer-sized integer type (like size_t) and then cast that to bool. (You could also go bool to size_t to void* on the input side for visual symmetry, but that isn't needed since bool to void* is specified). I'd be pretty amazed if any compilers don't do this implicitly, as I said. But that way you are doing 2 specified/defined casts rather than 1 undefined one.
Comment 3, I'm not sure how to fix that one, I'd need to really dig into the ruby internals. At a glance it looks like changing the method signature isn't necessarily straightforward since it is passed around as a func pointer in other places. I'm inlined to say, leave this one for now and just make a note of it on the off chance that some bug traces back to around this line. The only fixes I can think of are ugly and might invoke copy constructors (which then break reference semantics)....
Comment 4, likewise, I'd have to dig into the ruby to see how that is alloc'd. I just see a bunch of things like this:
ContextInfo* context_info;
Data_Get_Struct(self, ContextInfo, context_info);
[ this instance is from line 456 ]
and ContextInfo includes an IsolateInfo which has the refs_count, and since I can't see where it was initialized, I flagged it to make sure. It might be easier, rather than trying to track all the places this happens, to just convert ContextInfo into a lightweight class with a default ctor that sets default values, so that no matter HOW they are made, that is set.
Which just leaves Comment 1, which is the one I think most likely to cause problems. In general, casting function pointers is just asking for trouble. The first, most obvious, option is to make those two functions have the proper signature that rb_rescue expects. It might mean you have to deal with varargs functions here, which is annoying and verbose, but it would get rid of the need to cast. There are other options, but since you can't change the ruby side of the code, most of them are nonstarters (i.e. you can't change the entire library to use functors because you'd need ruby to cooperate).
EDIT: If those functions are part of the miniracer API and can't have their sigs changed without causing everyone some pain, you could make internal varargs funcs that you use when passing pointers, and have the public non-varargs versions just call into the varargs version. Or the reverse, either way.
Also, related to those casts, on the same line I see (VALUE)(&callback_data), were callback_data is already a VALUE (so &callbackdata is a VALUE). I don't know what the VALUE type is, but casting a VALUE_ to a VALUE looks suspicious to me. But, since this does seem to work at least some of the time, it can't be that bad.
As for the things discussed since.... I just am not familiar enough with how ruby and v8 work to comment on how to handle really poorly timed timeouts/etc that kill code in the worst possible places (other than, to quote @seanmakesgames, "don't do that" -- i.e. make sure that some code is protected from timeouts)
Thanks @dylanryan. these all look pretty straightforward. @SamSaffron I can provide a pr for some if it if you'd like.
sure any help would be more than welcome!
On Sun, Jul 31, 2016 at 9:33 AM, seanmakesgames notifications@github.com wrote:
Thanks @dylanryan https://github.com/dylanryan. these all look pretty straightforward. @SamSaffron https://github.com/SamSaffron I can provide a pr for some if it if you'd like.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/discourse/mini_racer/issues/33#issuecomment-236396133, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAUXe1TVpfr7cIUXJaL9y39Uiw_qSqQks5qa99EgaJpZM4JWl7b .
@SamSaffron putting together the PR now for the code review stuff, but I actually can't do the top-level to_json-and-back test. My sandbox relies too much on global variables. Can you tell me more about what you're trying to test with the top-level serialization?
In my googling for comment number 4, I'm getting pages saying that Data_Wrap_Struct does not automatically call '#initialize' in rubyland. Seems like this would have obvious symptoms?, but maybe not.
To do it, it looks like rb_obj_call_init
needs to be called on the Data_Wrap_Struct object after the wrap call.
I don't know enough about the situation to make any kind of exec decision here, but there's a lot of conflicting examples online, the ones that omit the call to 'call_init' do not make mention of its existence.
After review of comment number 1, it looks to be that RUBY_METHOD_FUNC() is the appropriate way to pass functions into the ruby methods. However, it just resolves to a cast to (VALUE (*)(ANYARGS))(func) where ANYARGS resolves to ... in c++ and empty in c.
I did, though, find all examples that are passing in objects in the 2nd and fourth params, to be of Ruby-allocated VALUE type, whereas we are passing in c-structs and not Data_Wrap_Structs. It looks like the arguments themselves are being properly marshalled into ruby VALUEs but the top level argument remains of struct typedef protected_callback_data
I'm not sure if this causes problems (lots of the struct members are properly typed as VALUE and marshalled), but it does seem wrong to cast memory that's not allocated with the ruby GC to a VALUE type.
Current overall status of comments:
https://github.com/discourse/mini_racer/compare/master...seanmakesgames:cpp_uninit?expand=1
outstanding questions:
a. Is there another way to get at the issue you're trying to test without top-level eval parse/stringify wrapping? b. Does the protected_callback_data struct need to be Ruby-GC-alloc'd for passing into rb_rescue? c. What happens if the ruby proc args allocation fails on line 635 during the loop? Do we get garbage for the rest of the arguments?
Empty args in c is at least sane, compared varargs in c++.
As best I can tell, in c, empty arguments simply means "A function exists, and it has this return value, and all the arguments are ints or doubles, and the number of args it has is whatever you tell me it is.". The compiler then looks at the callsite, and sees, ok, you are calling it with 3 arguments, so the function must have 3 arguments, and you are giving me a char and a short and a float, so that means int, int, double. Ok, cool, the function takes int,int,double. I'll plant all the usual code for that, and if the actual function is different, it'll explode at link time so not my problem." and then it proceeds as a completely normal function call using all the normal rules and calling conventions. (I am not sure how pointers are handled; I suspect that either void* is another type that is converted to, or pointers are converted to ints which will work on 32 bit and explode dramatically on 64 bit).
They key thing: The compiler knows exactly how many arguments and even what their type is in that kind of function call, whereas with varargs it has no idea (because it could be dynamic at runtime based on user input) and on most platforms just uses the stack and lets the called function deal with that. I'd still personally avoid those casts if at all possible (but probably not it seems), BUT assuming that the values are of convertible types, that is actually reasonably sane in c-land. C++? Not so much.
Varargs: "I have no idea but the called function will do it manually. If anyone is wrong, we'll get a runtime error. Well, best-case scenario, we'll get a runtime error, anyway. I really hope we'll get a runtime error.".
Empty args in c: "I know exactly what this is, it is whatever you tell me it is, and you wouldn't lie to me because you want your code to work, therefore this is correct. If it isn't, we'll get a link time error."
Of course, if those func pointers are casted BACK to the proper types in the C code on Ruby's side, then all is right with the world. Func pointers will round trip casts to any other func pointers safely.
Based on the code in that particular file and the extension... I'm guessing it is compiled as C++ despite being mostly straight C..... Well, if that is the officially sanctioned approach and it doesn't fail spectacularly on the majority of machines, maybe they got lucky and all i386/x86_64 systems just happen to generate code that just happens to work in this case.
@dylanryan it does look like they are not casting it before calling it. So I'm not sure. http://rxr.whitequark.org/mri/source/eval.c#708
Either way, it does seem that this is 'as expected' for the ruby c-api :\ - so lets resolve other adjustments.
Bump.
Also here's another c-level exception. Might be caused by uninit mem?
#
#
bundle/ruby/2.2.0/gems/activesupport-4.2.6/lib/active_support/core_ext/time/calculations.rb:37: [BUG] Illegal instruction at 0x007f8cc50c99c9
I have confirmed that these segfaults are related to during script-run GC issues (most likely moving ruby/c-memory around when it's still in use by ruby or c++) Please see #41
This is less of a 'fix something in mini_racer' and more of a 'am I doing this right?' issue.
There are two issues which are causing me problems:
The issues happen inconsistently at least once a day, occurring more often with higher user concurrency.
Context:
-> client sends request to server 'some_js_scriptname' -> unicorn handles request on a worker process -> worker process validates request then schedules job with run info
-> delayed_job worker daemon process picks up job (only one worker; 1 job at a time) -> in delayed_job worker: -> js source code is pulled from mongo -> context1 is lazy-init (stays for entire life of worker process) to validate source code -> context2 is created to run source code -> context2 attaches procs for the source code to access db operations (above issue procs) -> context2 evals -> context2 stop -> context2 result stored in mongo -> worker picks up next job
-> client sends request to server: get_script_results -> unicorn handles req -> gets results for user from mongo
multiple users, multiple clients, multiple script calls, multiple result queries, one worker (atm) runs contexts synchronously.
I'd love to get a better understanding of where I might need to synchronize operations, or where things could be falling apart. Any help is appreciated.