Open catap opened 1 year ago
I've checked the link above: I don't have an immediate advice how to find the root case, it is strange to see a stack trace with tons of GC_inner_start_routine and GC_wait_marker, like it's an infinite recursion.
@ivmai I can't reproduce Sergey's behaviour but one which I shown before is easy to reproduce by switching compilers.
And it seems like a strong lead to the GC because I can find compiler specified code only at GC.
I can find compiler specified code only at GC.
I don't understand what does it mean
;;; ECL C Backtrace
I don't anything related to bdwgc in the stack trace
Ok, I do have some solid evidence that GC might be the cause. With help of git bisect
I was able to find the exact commit which introduced this crash: https://gitlab.com/embeddable-common-lisp/ecl/-/commit/6e5016dcb869bb02ab6ee089d4dc9f266b042bf1
We're interesting in chunk:
@@ -398,6 +399,8 @@ struct ecl_hashtable { /* hash table header */
_ECL_HDR2(test,weak);
struct ecl_hashtable_entry *data; /* pointer to the hash table */
cl_object sync_lock; /* synchronization lock */
+ cl_object generic_test; /* generic test function */
+ cl_object generic_hash; /* generic hashing function */
cl_index entries; /* number of entries */
cl_index size; /* hash table size */
cl_index limit; /* hash table threshold (integer value) */
which patching a structure:
struct ecl_hashtable { /* hash table header */
_ECL_HDR2(test,weak);
struct ecl_hashtable_entry *data; /* pointer to the hash table */
cl_object sync_lock; /* synchronization lock */
cl_index entries; /* number of entries */
cl_index size; /* hash table size */
cl_index limit; /* hash table threshold (integer value) */
cl_object rehash_size; /* rehash size */
cl_object threshold; /* rehash threshold */
double factor; /* cached value of threshold */
cl_object (*get)(cl_object, cl_object, cl_object);
cl_object (*set)(cl_object, cl_object, cl_object);
bool (*rem)(cl_object, cl_object);
/* Unsafe variants are used to store the real accessors when
the synchronized variant is bound to get/set/rem. */
cl_object (*get_unsafe)(cl_object, cl_object, cl_object);
cl_object (*set_unsafe)(cl_object, cl_object, cl_object);
bool (*rem_unsafe)(cl_object, cl_object);
};
Simple adds two more fields.
This structure is allocated as
cl_object obj;
ecl_disable_interrupts_env(the_env);
obj = (cl_object)GC_MALLOC(type_info[t].size);
ecl_enable_interrupts_env(the_env);
obj->d.t = t;
return obj;
And size
is defined that that structure as sizeof(struct ecl_hashtable)
.
If I hacked code and put 2 * sizeof(struct ecl_hashtable)
=> it works.
So, let me summarizy. If I use https://gitlab.com/embeddable-common-lisp/ecl/-/commit/aa985f566fdedd45e2c74774d6e81f2442dd3802 as local root it works.
When I apply patch:
--- a/src/h/object.h
+++ b/src/h/object.h
@@ -398,6 +398,8 @@ struct ecl_hashtable { /* hash table header */
_ECL_HDR2(test,weak);
struct ecl_hashtable_entry *data; /* pointer to the hash table */
cl_object sync_lock; /* synchronization lock */
+ cl_object generic_test; /* generic test function */
+ cl_object generic_hash; /* generic hashing function */
cl_index entries; /* number of entries */
cl_index size; /* hash table size */
cl_index limit; /* hash table threshold (integer value) */
=> it crashes, but when I reserve twi time more space via hack:
--- a/src/c/alloc_2.d
+++ b/src/c/alloc_2.d
@@ -860,9 +860,9 @@ init_alloc(void)
init_tm(t_symbol, "SYMBOL", sizeof(struct ecl_symbol), 5);
init_tm(t_package, "PACKAGE", sizeof(struct ecl_package), -1); /* 36 */
#ifdef ECL_THREADS
- init_tm(t_hashtable, "HASH-TABLE", sizeof(struct ecl_hashtable), 3);
+ init_tm(t_hashtable, "HASH-TABLE", 2 * sizeof(struct ecl_hashtable), 3);
#else
- init_tm(t_hashtable, "HASH-TABLE", sizeof(struct ecl_hashtable), 4);
+ init_tm(t_hashtable, "HASH-TABLE", 2 * sizeof(struct ecl_hashtable), 4);
#endif
init_tm(t_array, "ARRAY", sizeof(struct ecl_array), 3);
init_tm(t_vector, "VECTOR", sizeof(struct ecl_vector), 2);
well.. it works again.
Have I missed something? Thus, when I change compiler, it masks the issue.
The issue is reproduced when I build GC from 8.2.4 release, 75e522cdbe009f103791e9cefe84f4f96f19c1b7 (release-8.2 branch) and 07a6d0ee8889bca5eaeadc13cabadc363725d216 (master branch).
And libatomic-7.6.6 with boehmgc-7.6.8 reproduces the issue.
And libatomic-7.4.4 with boehmgc-7.6.0 reproduces the issue.
From another hand this issue can't be reproduced by GCC-7.5.0 that means that it might be clang-only things.
From another hand sizeof(struct ecl_hashtable)
is 72
on both compilers: GCC-7.5.0 and Clang-7.1.0.
@ivmai do you agree that it seems like GC issue?
Do you have any suggestion to dig?
Probably another related issue on macOS: https://gitlab.com/embeddable-common-lisp/ecl/-/issues/718
I was able to reproduce it with some probability on my laptop with macOS 12.
I've applied a patch on 8.2.4
:
--- darwin_stop_world.c
+++ darwin_stop_world.c
@@ -639,6 +639,11 @@ GC_INNER void GC_stop_world(void)
kern_result = thread_suspend(p -> mach_thread);
} while (kern_result == KERN_ABORTED);
GC_release_dirty_lock();
+ if ((((p) -> flags & FINISHED) != FINISHED)
+ && kern_result == KERN_TERMINATED)
+ continue;
+ if (kern_result == KERN_TERMINATED)
+ ABORT("thread_suspend failed: it was already termindated");
if (kern_result != KERN_SUCCESS)
ABORT("thread_suspend failed");
if (GC_on_thread_event)
and run tests of cl-njson
multiple times via ECL with that boehmgc. Let say 1-out-of-15 attempts fails as:
thread_suspend failed: it was already termindated
Well.. seems that it broke stack or some memory somehow.
A function thread_suspend
is looks like:
kern_return_t
thread_suspend(thread_t thread)
{
kern_return_t result = KERN_SUCCESS;
if (thread == THREAD_NULL || get_threadtask(thread) == kernel_task) {
return KERN_INVALID_ARGUMENT;
}
thread_mtx_lock(thread);
if (thread->active) {
if (thread->user_stop_count++ == 0) {
thread_hold(thread);
}
} else {
result = KERN_TERMINATED;
}
thread_mtx_unlock(thread);
if (thread != current_thread() && result == KERN_SUCCESS) {
thread_wait(thread, FALSE);
}
return result;
}
from here: https://github.com/apple-open-source/macos/blob/master/xnu/osfmk/kern/thread_act.c#L382-L408
and if I modify mine patch to:
@@ -639,8 +639,13 @@ GC_INNER void GC_stop_world(void)
kern_result = thread_suspend(p -> mach_thread);
} while (kern_result == KERN_ABORTED);
GC_release_dirty_lock();
- if (kern_result != KERN_SUCCESS)
+ if (kern_result != KERN_SUCCESS) {
+ fprintf(stderr, "KERN_SUCCESS: %d\n", KERN_SUCCESS);
+ fprintf(stderr, "KERN_TERMINATED: %d\n", KERN_TERMINATED);
+ fprintf(stderr, "KERN_INVALID_ARGUMENT: %d\n", KERN_INVALID_ARGUMENT);
+ fprintf(stderr, "kern_result: %d\n", kern_result);
ABORT("thread_suspend failed");
+ }
if (GC_on_thread_event)
GC_on_thread_event(GC_EVENT_THREAD_SUSPENDED,
(void *)(word)(p -> mach_thread));
and run it a lot of times (more than hundred) in the loop, it might fails as:
:info:build KERN_SUCCESS: 0
:info:build KERN_TERMINATED: 37
:info:build KERN_INVALID_ARGUMENT: 4
:info:build kern_result: 268435459
:info:build thread_suspend failed
It looks like the thread is terminated in parallel to GC_stop_world(). This is strange becase the GC lock is held during GC_stop_world and the terminating threads should call GC_thread_exit_proc which also should acquire the lock.
Please try to figure out the death scenario of the the thread for which thread_suspend returns anything other than KERN_SUCCESS.
@ivmai i have no idea how.
I run tests via ECL and it fails. Rarely.
@ivmai I've added one clean debug log line:
if (kern_result != KERN_SUCCESS) {
# ifdef DEBUG_THREADS
GC_log_printf("thread_suspend(%d) returns %d\n", p->stop_info.mach_thread, kern_result);
# endif
ABORT("thread_suspend failed");
}
and rebuild GC with -DDEBUG_THREADS
which leads to log:
:info:build Stopping the world from thread 0x303
:info:build thread_suspend(259) returns 4
:info:build thread_suspend failed
:info:build An error occurred during initialization:
:info:build Error code 6 when executing
:info:build (EXT:RUN-PROGRAM "/usr/bin/clang" Stopping the world from thread 0x103
:info:build World stopped from 0x103
:info:build Darwin: Stack for thread 0x103 is [0x7ff7b5a81e80,0x7ff7b5a8c000)
:info:build thread_get_state returns 0
:info:build Darwin: Stack for thread 0x2803 is [0x700003188d08,0x700003188f98)
259
is 0x103
and 4
is KERN_INVALID_ARGUMENT
which may occures only on condition:
if (thread == THREAD_NULL || get_threadtask(thread) == kernel_task) {
return KERN_INVALID_ARGUMENT;
}
It defently not a kernel task, and can't be THREAD_NULL
because it is used to stop the world after abort.
Something very wrong here.
It looks like the thread is terminated in parallel to GC_stop_world(). This is strange becase the GC lock is held during GC_stop_world and the terminating threads should call GC_thread_exit_proc which also should acquire the lock.
Please try to figure out the death scenario of the the thread for which thread_suspend returns anything other than KERN_SUCCESS.
And I've figured out it.
The root cause of issue that mach_thread_self()
may return different value after fork()
. Different means different that is stored inside GC_threads
.
GC_remove_all_threads_but_me()
updated that value but seems that it's doing so too late.
As result thread_suspend
returns KERN_INVALID_ARGUMENT
or 268435459
which is (ipc/send) invalid destination port
by mach_error_string
.
As a naive approach I've tried to move GC_remove_all_threads_but_me()
inside fork_child_proc()
to the first line, but it doesn't help.
The proof. I've used one commit ahead of 8.2.4
: https://github.com/catap/bdwgc/commit/a7e17f92ff5aac44d457ad2b7f28361232445c2a which adds getpid()
to the log.
If may fail on test case with output like:
[14298] Stopping the world from thread 0x103
[14298] thread_suspend(0xb03) returns (os/kern) successful
World stopped from 0x103
thread_get_state returns 0
Darwin: Stack for thread 0xb03 is [0x70000aebb068,0x70000aebdf78)
Darwin: Stack for thread 0x103 is [0x7ff7b6401280,0x7ff7b640c000)
Starting mark helper for mark number 566
Starting mark helper for mark number 566
World starting
Resuming thread 0xb03 with state 3
World started
Total stacks size: 56464
;;; Note:
;;; Invoking external command:
;;; clang -I. -I/Users/runner/work/cl-duckdb/cl-duckdb/ecl/build/ -DECL_API -I/Users/runner/work/cl-duckdb/cl-duckdb/ecl/build/c -DDEBUG_THREADS -fPIC -fno-common -D_THREAD_SAFE -Ddarwin -I/Users/runner/work/cl-duckdb/cl-duckdb/ecl/src/c -c eclinit3Cg6rN.c -o eclinit3Cg6rN.o
;;; Note:
;;; Invoking external command:
[14430] Stopping the world from thread 0x303
[14430] thread_suspend(0xb03) returns (ipc/send) invalid destination port
thread_suspend failed
[14298] Stopping the world from thread 0x103
[14298] thread_suspend(0xb03) returns (os/kern) successful
World stopped from 0x103
thread_get_state returns 0
Darwin: Stack for thread 0xb03 is [0x70000aebb068,0x70000aebdf78)
Darwin: Stack for thread 0x103 is [0x7ff7b63f0a00,0x7ff7b640c000)
Starting mark helper for mark number 567
Starting mark helper for mark number 567
World starting
Resuming thread 0xb03 with state 3
World started
and 0x303
it is the first and the only appears of 0x303
; thus no update thread id
which means GC_remove_all_threads_but_me()
hasn't been called or hasn't updated the reference (yet?)
Was GC_remove_all_threads_but_me called?
@ivmai I've added a log line into GC_remove_all_threads_but_me
and can't find it in output => seems that it hadn't been called.
Which is very strange.
Yes, it is happened after fork. I was able to recover a stack trace:
0 libsystem_kernel.dylib 0x7ff81e0a7ffe __pthread_kill + 10
1 libsystem_pthread.dylib 0x7ff81e0de1ff pthread_kill + 263
2 libsystem_c.dylib 0x7ff81e029d24 abort + 123
3 libgc.1.dylib 0x10a5c4bc3 GC_stop_world.cold.1 + 43
4 libgc.1.dylib 0x10a5c340f GC_stop_world + 276
5 libgc.1.dylib 0x10a5b3010 GC_stopped_mark + 70
6 libgc.1.dylib 0x10a5b2ecd GC_try_to_collect_inner + 311
7 libgc.1.dylib 0x10a5b40fe GC_collect_or_expand + 189
8 libgc.1.dylib 0x10a5b4380 GC_allocobj + 265
9 libgc.1.dylib 0x10a5b8baa GC_generic_malloc_inner + 253
10 libgc.1.dylib 0x10a5b8e3a GC_generic_malloc + 83
11 libecl.23.9.9.dylib 0x10a9eb821 ecl_alloc_atomic + 32
12 libecl.23.9.9.dylib 0x10a99db75 si_make_sequence_output_stream + 333
13 libecl.23.9.9.dylib 0x10a9aeb23 si_string_to_octets + 499
14 libecl.23.9.9.dylib 0x10a9e35dd si_spawn_subprocess + 577
15 libecl.23.9.9.dylib 0x10a96928e si_run_program + 1744
inside ECL the code which leads to an issue looks like:
child_pid = fork();
if (child_pid == 0) {
/* Child */
int j;
cl_object p;
char **argv_ptr = ecl_alloc((ecl_length(argv) + 1) * sizeof(char*));
for (p = argv, j = 0; p != ECL_NIL; p = ECL_CONS_CDR(p)) {
cl_object arg = si_string_to_octets(3, ECL_CONS_CAR(p), @':null-terminate', ECL_T);
argv_ptr[j++] = (char*)arg->base_string.self;
}
argv_ptr[j] = NULL;
Status update:
Okay, more investigation is needed for the original issue.
@ivmai yes, but I'm out of idea how to proceed
@ivmai what do you think to enable "handle fork" by default for macOS? Base on behaviour of mach_thread_self()
I feel that it is common and very difficult to track issue.
For example scala-native
uses that code and instead of replicate ECL's changes in scala-native and many other software I feel that it simple should be enabled by default.
what do you think to enable "handle fork" by default for macOS? Base on behavior of mach_thread_self() I feel that it is common and very difficult to track issue.
Yes, agree, but after implementing #103
As of now: # The incremental mode conflicts with fork handling on Darwin.
what do you think to enable "handle fork" by default for macOS? Base on behaviour of mach_thread_self() I feel that it is common and very difficult to track issue.
Implemented by commit https://github.com/ivmai/bdwgc/commit/ba2861e70ccbc8ecd4d7611cf5433d30e76f68a1 @catap, please test it
As I understand the original issue is not resolve, so I'm leaving it open.
As I understand the original issue is not resolve, so I'm leaving it open.
What remains to be done to fix it?
As I understand the original issue is not resolve, so I'm leaving it open.
What remains to be done to fix it?
It still fails as it was described here: https://github.com/ivmai/bdwgc/issues/569#issuecomment-1732285221
Is there any fix proposed?
The root cause of issue that mach_thread_self() may return different value after fork(). Different means different that is stored inside GC_threads. GC_remove_all_threads_but_me() updated that value but seems that it's doing so too late.
This is not fixed, right?
@ivmai let me summarize everything.
Here I've mixed up two issues.
(2) can be reproduced only by clang on i386 on macOS. I can't reproduce it by gcc.
I not sure that it is BoehmGC issues, but ECL's maintainer guess that it is. See: https://gitlab.com/embeddable-common-lisp/ecl/-/issues/705
Long story short: an attempt to build ECL on macOS 10.5 i386 fails as:
If I change compiler from Clang-7 (default at MacPorts) to GCC-7 => it works.
I've run some tests and discovers that Clang-5+ leads to this result. But clang-3.7 works fine.