segfault during 'run-in-thread'

spk121 commented 3 years ago

In writing a test case for the <GTask>, I've caused a SIGSEGV in SCM_STACK_CHECK when calling class-slot-ref on the <GTask>. The segfault occurs in code being run in the thread spawned by g_run_in_thread.

Root cause TBD, but, I expect that we need a call to scm_init_guile() in any GLib-managed thread.

A reproducing case follows

spk121 commented 3 years ago

segv-task-thread.scm.txt

spk121 commented 3 years ago

But replacing GLib's run-in-thread with Guile's call-with-new-thread seems okay so far. Maybe the fix is don't do that.

LordYuuma commented 3 years ago

Having used Guile from GTK without Guile-GI/G-Golf before, I do faintly remember, that you have to scm_init_guile in that thread. I think in our case, we might want to do that when entering callbacks and closures. Would that solve the issue or are things also buried more deeply in gig_argument etc?

spk121 commented 3 years ago

That's probably enough. It appears that g_task_thread_pool_thread FFI goes to callback_binding() before trying to unpack any arguments, and that calling scm_init_guile in callback_binding() fixes.

That then reveals a further problem in this test. The creation of a <GTask> doesn't require a source object, but, the source object of a <GTask> ends up passed to the callback GTaskThreadFunc where the source object is not marked nullable. But that's a separate issue.

spk121 commented 3 years ago

OK, more info. OK, at some point I'd converted some of the type errors under gig_argument_c_to_scm to Guile errors scm_wrong_type etc.

After adding scm_init_guile to callback_binding we're always in a Guile-managed thread, but, if that thread was called by g_task_run_in_thread, it is not within a top-level scm_c_catch or scm_c_with_exception_handler as far as I can tell. Guile errors are not caught and cause the program to end.

One brute force solution would be to enclose the guts of callback_binding in a continuation barrier.

LordYuuma commented 3 years ago

Perhaps we should use scm_with_guile instead of scm_init_guile. Firstly, it neatly cleans up the Guile mode, but it also wraps the function in scm_with_continuation_barrier. FWIW my initial implementation of callbacks also had an exception handler, that did a stack trace.

spk121 commented 3 years ago

I like that idea.

spk121 commented 3 years ago

I think what I'd really want in callback_binding and c_callback_binding is to install a continuation barrier only if we know that one doesn't already exist.

    if (scm_thread_is_known_to_guile ())
        callback_binding_inner(&args);
    else {
       if (NULL == scm_with_guile (callback_binding_inner, &args))
           abort ();
    }

Guile-managed threads can throw and be caught from before the FFI call, but, non-Guile-originated threads will install a local continuation barrier and abort on error. At the moment, I don't know how to implement scm_thread_is_known_to_guile using Guile API. If you have access to Guile internals, it is just

int
scm_thread_is_known_to_guile()
{
  if (SCM_I_CURRENT_THREAD)
    return 1;
  return 0;
}

LordYuuma commented 3 years ago

Actually, you'd have to check SCM_I_CURRENT_THREAD->guile_mode. Perhaps we can instead keep a set of tids, that are currently in Guile mode?

spk121 commented 3 years ago

OK. I changed the logic to catch callback errors and immediately quit if we're not in the main thread, or to throw if we're in the main thread, which is a decent compromise. This saves me from trying to figure out if a thread is guileified and if I need to re-throw a caught error

LordYuuma commented 3 years ago

I don't think I particularly like the eval exit there. Would there be a better way to at least print the stack trace first and then call scm_exit or exit ourselves as a procedure?

Also I feel as though the callback thread fluid should be #t once the callback has been spawned, so that inner callbacks can still throw errors up the stack as it were, only the topmost one can't. Pardon me if that's already done and I'm just misreading the source code there.

spk121 commented 3 years ago

For the latter point, that's a great idea, and I hadn't thought of that. For the former, the scm_with_guile should dump the stack trace for the callback because it installs an exception handler. It won't do the stack trace of the main thread, though. Is that what you're asking?

Also, there is an scm_primitive_exit, but, the clean exit is only an API from scheme.

LordYuuma commented 3 years ago

Hmm, as long as the relevant code portion is accurately traced, that's fine. IIUC in the main thread, the error is thrown upwards anyway, so intermediate traces don't need printing. I still feel that we shouldn't have a raw "scm_eval" in there, though, perhaps call by module-ref?

spk121 commented 3 years ago

Interestingly, while the current tree apparently checks out fine in both Azure and Travis, the test/task.scm test in d719e466b0b2e87f3bb03e20ab8a1877c8176a40 always segfaults on a couple of my boxes. The segfault, as near as I can tell, is due to a foreign function being called after atexit has kicked off the freeing of the FFI-defined functions and callback in gig_function.c and gig_callback.c. So really this is more aligned with #60

LordYuuma commented 3 years ago

It runs fine in Travis/Azure, because the test is skipped. Perhaps you want primitive-_exit instead of hard exit? Downside is we'd have to unwind the stack on our own.

spk121 / guile-gi

segfault during 'run-in-thread' #105