Crash on building ECL on macOS 10.5 i386 with modern Clang

catap commented 1 year ago

I not sure that it is BoehmGC issues, but ECL's maintainer guess that it is. See: https://gitlab.com/embeddable-common-lisp/ecl/-/issues/705

Long story short: an attempt to build ECL on macOS 10.5 i386 fails as:

;;; ECL C Backtrace
;;; 0   ecl_min                             0x00038c9b _ecl_dump_c_backtrace + 47
;;; 1   ecl_min                             0x0002cfc8 ecl_internal_error + 110
;;; 2   ecl_min                             0x0004fa0f deferred_signal_handler + 0
;;; 3   ecl_min                             0x0004f79d sigsegv_handler + 262
;;; 4   libSystem.B.dylib                   0x0025d05b _sigtramp + 43
;;; 5   ???                                 0xffffffff 0x0 + 4294967295
;;; 6   ecl_min                             0x00051db7 _ecl_sethash_pack + 66
;;; 7   ecl_min                             0x00007a97 cl_export2 + 579
;;; 8   ecl_min                             0x000042ef init_all_symbols + 431
;;; 9   ecl_min                             0x00004ba0 cl_boot + 1065
;;; 10  ecl_min                             0x00003a0e main + 33
;;; 11  ecl_min                             0x00003575 start + 53
/bin/sh: line 1: 76282 Abort trap              ECLDIR=`pwd`/ ./ecl_min compile

If I change compiler from Clang-7 (default at MacPorts) to GCC-7 => it works.

I've run some tests and discovers that Clang-5+ leads to this result. But clang-3.7 works fine.

ivmai commented 1 year ago

I've checked the link above: I don't have an immediate advice how to find the root case, it is strange to see a stack trace with tons of GC_inner_start_routine and GC_wait_marker, like it's an infinite recursion.

catap commented 1 year ago

@ivmai I can't reproduce Sergey's behaviour but one which I shown before is easy to reproduce by switching compilers.

And it seems like a strong lead to the GC because I can find compiler specified code only at GC.

ivmai commented 1 year ago

I can find compiler specified code only at GC.

I don't understand what does it mean

;;; ECL C Backtrace

I don't anything related to bdwgc in the stack trace

catap commented 1 year ago

Ok, I do have some solid evidence that GC might be the cause. With help of git bisect I was able to find the exact commit which introduced this crash: https://gitlab.com/embeddable-common-lisp/ecl/-/commit/6e5016dcb869bb02ab6ee089d4dc9f266b042bf1

We're interesting in chunk:

@@ -398,6 +399,8 @@ struct ecl_hashtable {          /*  hash table header  */
         _ECL_HDR2(test,weak);
         struct ecl_hashtable_entry *data; /*  pointer to the hash table  */
         cl_object sync_lock;              /*  synchronization lock  */
+        cl_object generic_test;           /*  generic test function */
+        cl_object generic_hash;           /*  generic hashing function */
         cl_index entries;       /*  number of entries  */
         cl_index size;          /*  hash table size  */
         cl_index limit;         /*  hash table threshold (integer value)  */

which patching a structure:

struct ecl_hashtable {          /*  hash table header  */
        _ECL_HDR2(test,weak);
        struct ecl_hashtable_entry *data; /*  pointer to the hash table  */
        cl_object sync_lock;              /*  synchronization lock  */
        cl_index entries;       /*  number of entries  */
        cl_index size;          /*  hash table size  */
        cl_index limit;         /*  hash table threshold (integer value)  */
        cl_object rehash_size;  /*  rehash size  */
        cl_object threshold;    /*  rehash threshold  */
        double factor;          /*  cached value of threshold  */
        cl_object (*get)(cl_object, cl_object, cl_object);
        cl_object (*set)(cl_object, cl_object, cl_object);
        bool (*rem)(cl_object, cl_object);
        /* Unsafe variants are used to store the real accessors when
           the synchronized variant is bound to get/set/rem. */
        cl_object (*get_unsafe)(cl_object, cl_object, cl_object);
        cl_object (*set_unsafe)(cl_object, cl_object, cl_object);
        bool (*rem_unsafe)(cl_object, cl_object);
};

Simple adds two more fields.

This structure is allocated as

    cl_object obj;
    ecl_disable_interrupts_env(the_env);
    obj = (cl_object)GC_MALLOC(type_info[t].size);
    ecl_enable_interrupts_env(the_env);
    obj->d.t = t;
    return obj;

And size is defined that that structure as sizeof(struct ecl_hashtable).

If I hacked code and put 2 * sizeof(struct ecl_hashtable) => it works.

So, let me summarizy. If I use https://gitlab.com/embeddable-common-lisp/ecl/-/commit/aa985f566fdedd45e2c74774d6e81f2442dd3802 as local root it works.

When I apply patch:

--- a/src/h/object.h
+++ b/src/h/object.h
@@ -398,6 +398,8 @@ struct ecl_hashtable {          /*  hash table header  */
         _ECL_HDR2(test,weak);
         struct ecl_hashtable_entry *data; /*  pointer to the hash table  */
         cl_object sync_lock;              /*  synchronization lock  */
+        cl_object generic_test;           /*  generic test function */
+        cl_object generic_hash;           /*  generic hashing function */
         cl_index entries;       /*  number of entries  */
         cl_index size;          /*  hash table size  */
         cl_index limit;         /*  hash table threshold (integer value)  */

=> it crashes, but when I reserve twi time more space via hack:

--- a/src/c/alloc_2.d
+++ b/src/c/alloc_2.d
@@ -860,9 +860,9 @@ init_alloc(void)
   init_tm(t_symbol, "SYMBOL", sizeof(struct ecl_symbol), 5);
   init_tm(t_package, "PACKAGE", sizeof(struct ecl_package), -1); /* 36 */
 #ifdef ECL_THREADS
-  init_tm(t_hashtable, "HASH-TABLE", sizeof(struct ecl_hashtable), 3);
+  init_tm(t_hashtable, "HASH-TABLE", 2 * sizeof(struct ecl_hashtable), 3);
 #else
-  init_tm(t_hashtable, "HASH-TABLE", sizeof(struct ecl_hashtable), 4);
+  init_tm(t_hashtable, "HASH-TABLE", 2 * sizeof(struct ecl_hashtable), 4);
 #endif
   init_tm(t_array, "ARRAY", sizeof(struct ecl_array), 3);
   init_tm(t_vector, "VECTOR", sizeof(struct ecl_vector), 2);

well.. it works again.

Have I missed something? Thus, when I change compiler, it masks the issue.

catap commented 1 year ago

The issue is reproduced when I build GC from 8.2.4 release, 75e522cdbe009f103791e9cefe84f4f96f19c1b7 (release-8.2 branch) and 07a6d0ee8889bca5eaeadc13cabadc363725d216 (master branch).

catap commented 1 year ago

And libatomic-7.6.6 with boehmgc-7.6.8 reproduces the issue.

catap commented 1 year ago

And libatomic-7.4.4 with boehmgc-7.6.0 reproduces the issue.

catap commented 1 year ago

From another hand this issue can't be reproduced by GCC-7.5.0 that means that it might be clang-only things.

catap commented 1 year ago

From another hand sizeof(struct ecl_hashtable) is 72 on both compilers: GCC-7.5.0 and Clang-7.1.0.

@ivmai do you agree that it seems like GC issue?

Do you have any suggestion to dig?

catap commented 1 year ago

Probably another related issue on macOS: https://gitlab.com/embeddable-common-lisp/ecl/-/issues/718

catap commented 1 year ago

I was able to reproduce it with some probability on my laptop with macOS 12.

I've applied a patch on 8.2.4:

--- darwin_stop_world.c
+++ darwin_stop_world.c
@@ -639,6 +639,11 @@ GC_INNER void GC_stop_world(void)
             kern_result = thread_suspend(p -> mach_thread);
           } while (kern_result == KERN_ABORTED);
           GC_release_dirty_lock();
+          if ((((p) -> flags & FINISHED) != FINISHED)
+              && kern_result == KERN_TERMINATED)
+            continue;
+          if (kern_result == KERN_TERMINATED)
+            ABORT("thread_suspend failed: it was already termindated");
           if (kern_result != KERN_SUCCESS)
             ABORT("thread_suspend failed");
           if (GC_on_thread_event)

and run tests of cl-njson multiple times via ECL with that boehmgc. Let say 1-out-of-15 attempts fails as:

thread_suspend failed: it was already termindated

catap commented 1 year ago

Well.. seems that it broke stack or some memory somehow.

A function thread_suspend is looks like:

kern_return_t
thread_suspend(thread_t thread)
{
    kern_return_t result = KERN_SUCCESS;

    if (thread == THREAD_NULL || get_threadtask(thread) == kernel_task) {
        return KERN_INVALID_ARGUMENT;
    }

    thread_mtx_lock(thread);

    if (thread->active) {
        if (thread->user_stop_count++ == 0) {
            thread_hold(thread);
        }
    } else {
        result = KERN_TERMINATED;
    }

    thread_mtx_unlock(thread);

    if (thread != current_thread() && result == KERN_SUCCESS) {
        thread_wait(thread, FALSE);
    }

    return result;
}

from here: https://github.com/apple-open-source/macos/blob/master/xnu/osfmk/kern/thread_act.c#L382-L408

and if I modify mine patch to:

@@ -639,8 +639,13 @@ GC_INNER void GC_stop_world(void)
             kern_result = thread_suspend(p -> mach_thread);
           } while (kern_result == KERN_ABORTED);
           GC_release_dirty_lock();
-          if (kern_result != KERN_SUCCESS)
+          if (kern_result != KERN_SUCCESS) {
+            fprintf(stderr, "KERN_SUCCESS: %d\n", KERN_SUCCESS);
+            fprintf(stderr, "KERN_TERMINATED: %d\n", KERN_TERMINATED);
+            fprintf(stderr, "KERN_INVALID_ARGUMENT: %d\n", KERN_INVALID_ARGUMENT);
+            fprintf(stderr, "kern_result: %d\n", kern_result);
             ABORT("thread_suspend failed");
+          }
           if (GC_on_thread_event)
             GC_on_thread_event(GC_EVENT_THREAD_SUSPENDED,
                                (void *)(word)(p -> mach_thread));

and run it a lot of times (more than hundred) in the loop, it might fails as:

:info:build KERN_SUCCESS: 0
:info:build KERN_TERMINATED: 37
:info:build KERN_INVALID_ARGUMENT: 4
:info:build kern_result: 268435459
:info:build thread_suspend failed

ivmai commented 1 year ago

It looks like the thread is terminated in parallel to GC_stop_world(). This is strange becase the GC lock is held during GC_stop_world and the terminating threads should call GC_thread_exit_proc which also should acquire the lock.

Please try to figure out the death scenario of the the thread for which thread_suspend returns anything other than KERN_SUCCESS.

catap commented 1 year ago

@ivmai i have no idea how.

I run tests via ECL and it fails. Rarely.

catap commented 1 year ago

@ivmai I've added one clean debug log line:

          if (kern_result != KERN_SUCCESS) {
# ifdef DEBUG_THREADS
            GC_log_printf("thread_suspend(%d) returns %d\n", p->stop_info.mach_thread, kern_result);
# endif
            ABORT("thread_suspend failed");
          }

and rebuild GC with -DDEBUG_THREADS which leads to log:

:info:build Stopping the world from thread 0x303
:info:build thread_suspend(259) returns 4
:info:build thread_suspend failed
:info:build An error occurred during initialization:
:info:build Error code 6 when executing
:info:build (EXT:RUN-PROGRAM "/usr/bin/clang" Stopping the world from thread 0x103
:info:build World stopped from 0x103
:info:build Darwin: Stack for thread 0x103 is [0x7ff7b5a81e80,0x7ff7b5a8c000)
:info:build thread_get_state returns 0
:info:build Darwin: Stack for thread 0x2803 is [0x700003188d08,0x700003188f98)

259 is 0x103 and 4 is KERN_INVALID_ARGUMENT which may occures only on condition:

    if (thread == THREAD_NULL || get_threadtask(thread) == kernel_task) {
        return KERN_INVALID_ARGUMENT;
    }

It defently not a kernel task, and can't be THREAD_NULL because it is used to stop the world after abort.

Something very wrong here.

catap commented 1 year ago

It looks like the thread is terminated in parallel to GC_stop_world(). This is strange becase the GC lock is held during GC_stop_world and the terminating threads should call GC_thread_exit_proc which also should acquire the lock.

Please try to figure out the death scenario of the the thread for which thread_suspend returns anything other than KERN_SUCCESS.

And I've figured out it.

The root cause of issue that mach_thread_self() may return different value after fork(). Different means different that is stored inside GC_threads.

GC_remove_all_threads_but_me() updated that value but seems that it's doing so too late.

As result thread_suspend returns KERN_INVALID_ARGUMENT or 268435459 which is (ipc/send) invalid destination port by mach_error_string.

As a naive approach I've tried to move GC_remove_all_threads_but_me() inside fork_child_proc() to the first line, but it doesn't help.

catap commented 1 year ago

The proof. I've used one commit ahead of 8.2.4: https://github.com/catap/bdwgc/commit/a7e17f92ff5aac44d457ad2b7f28361232445c2a which adds getpid() to the log.

If may fail on test case with output like:

[14298] Stopping the world from thread 0x103
[14298] thread_suspend(0xb03) returns (os/kern) successful
World stopped from 0x103
thread_get_state returns 0
Darwin: Stack for thread 0xb03 is [0x70000aebb068,0x70000aebdf78)
Darwin: Stack for thread 0x103 is [0x7ff7b6401280,0x7ff7b640c000)
Starting mark helper for mark number 566
Starting mark helper for mark number 566
World starting
Resuming thread 0xb03 with state 3
World started
Total stacks size: 56464
;;; Note:
;;;   Invoking external command:
;;;   clang -I. -I/Users/runner/work/cl-duckdb/cl-duckdb/ecl/build/ -DECL_API -I/Users/runner/work/cl-duckdb/cl-duckdb/ecl/build/c -DDEBUG_THREADS -fPIC -fno-common -D_THREAD_SAFE -Ddarwin -I/Users/runner/work/cl-duckdb/cl-duckdb/ecl/src/c -c eclinit3Cg6rN.c -o eclinit3Cg6rN.o 
;;; Note:
;;;   Invoking external command:
[14430] Stopping the world from thread 0x303
[14430] thread_suspend(0xb03) returns (ipc/send) invalid destination port
thread_suspend failed
[14298] Stopping the world from thread 0x103
[14298] thread_suspend(0xb03) returns (os/kern) successful
World stopped from 0x103
thread_get_state returns 0
Darwin: Stack for thread 0xb03 is [0x70000aebb068,0x70000aebdf78)
Darwin: Stack for thread 0x103 is [0x7ff7b63f0a00,0x7ff7b640c000)
Starting mark helper for mark number 567
Starting mark helper for mark number 567
World starting
Resuming thread 0xb03 with state 3
World started

and 0x303 it is the first and the only appears of 0x303; thus no update thread id which means GC_remove_all_threads_but_me() hasn't been called or hasn't updated the reference (yet?)

ivmai commented 1 year ago

Was GC_remove_all_threads_but_me called?

catap commented 1 year ago

@ivmai I've added a log line into GC_remove_all_threads_but_me and can't find it in output => seems that it hadn't been called.

Which is very strange.

catap commented 1 year ago

Yes, it is happened after fork. I was able to recover a stack trace:

0   libsystem_kernel.dylib              0x7ff81e0a7ffe __pthread_kill + 10
1   libsystem_pthread.dylib             0x7ff81e0de1ff pthread_kill + 263
2   libsystem_c.dylib                   0x7ff81e029d24 abort + 123
3   libgc.1.dylib                          0x10a5c4bc3 GC_stop_world.cold.1 + 43
4   libgc.1.dylib                          0x10a5c340f GC_stop_world + 276
5   libgc.1.dylib                          0x10a5b3010 GC_stopped_mark + 70
6   libgc.1.dylib                          0x10a5b2ecd GC_try_to_collect_inner + 311
7   libgc.1.dylib                          0x10a5b40fe GC_collect_or_expand + 189
8   libgc.1.dylib                          0x10a5b4380 GC_allocobj + 265
9   libgc.1.dylib                          0x10a5b8baa GC_generic_malloc_inner + 253
10  libgc.1.dylib                          0x10a5b8e3a GC_generic_malloc + 83
11  libecl.23.9.9.dylib                    0x10a9eb821 ecl_alloc_atomic + 32
12  libecl.23.9.9.dylib                    0x10a99db75 si_make_sequence_output_stream + 333
13  libecl.23.9.9.dylib                    0x10a9aeb23 si_string_to_octets + 499
14  libecl.23.9.9.dylib                    0x10a9e35dd si_spawn_subprocess + 577
15  libecl.23.9.9.dylib                    0x10a96928e si_run_program + 1744

inside ECL the code which leads to an issue looks like:

    child_pid = fork();
    if (child_pid == 0) {
      /* Child */
      int j;
      cl_object p;
      char **argv_ptr = ecl_alloc((ecl_length(argv) + 1) * sizeof(char*));
      for (p = argv, j = 0; p != ECL_NIL; p = ECL_CONS_CDR(p)) {
        cl_object arg = si_string_to_octets(3, ECL_CONS_CAR(p), @':null-terminate', ECL_T);
        argv_ptr[j++] = (char*)arg->base_string.self;
      }
      argv_ptr[j] = NULL;

catap commented 1 year ago

Status update:

this isn't related to original issue which can be covered by GCC;
the root cause of issue was... nobody setup handle_fork, see: https://gitlab.com/embeddable-common-lisp/ecl/-/merge_requests/303

ivmai commented 1 year ago

Okay, more investigation is needed for the original issue.

catap commented 1 year ago

@ivmai yes, but I'm out of idea how to proceed

catap commented 1 year ago

@ivmai what do you think to enable "handle fork" by default for macOS? Base on behaviour of mach_thread_self() I feel that it is common and very difficult to track issue.

For example scala-native uses that code and instead of replicate ECL's changes in scala-native and many other software I feel that it simple should be enabled by default.

ivmai commented 1 year ago

what do you think to enable "handle fork" by default for macOS? Base on behavior of mach_thread_self() I feel that it is common and very difficult to track issue.

Yes, agree, but after implementing #103

As of now: # The incremental mode conflicts with fork handling on Darwin.

ivmai commented 11 months ago

what do you think to enable "handle fork" by default for macOS? Base on behaviour of mach_thread_self() I feel that it is common and very difficult to track issue.

Implemented by commit https://github.com/ivmai/bdwgc/commit/ba2861e70ccbc8ecd4d7611cf5433d30e76f68a1 @catap, please test it

ivmai commented 11 months ago

As I understand the original issue is not resolve, so I'm leaving it open.

cooljeanius commented 7 months ago

As I understand the original issue is not resolve, so I'm leaving it open.

What remains to be done to fix it?

catap commented 7 months ago

As I understand the original issue is not resolve, so I'm leaving it open.

What remains to be done to fix it?

It still fails as it was described here: https://github.com/ivmai/bdwgc/issues/569#issuecomment-1732285221

ivmai commented 7 months ago

Is there any fix proposed?

ivmai commented 7 months ago

The root cause of issue that mach_thread_self() may return different value after fork(). Different means different that is stored inside GC_threads. GC_remove_all_threads_but_me() updated that value but seems that it's doing so too late.

This is not fixed, right?

catap commented 7 months ago

@ivmai let me summarize everything.

Here I've mixed up two issues.

thread related which was fixed as far as I recall.
Something brokes stack -- not fixed.

(2) can be reproduced only by clang on i386 on macOS. I can't reproduce it by gcc.

ivmai / bdwgc

Crash on building ECL on macOS 10.5 i386 with modern Clang #569