Open ethanc8 opened 7 months ago
@gcasa Is this the right way to file this, or should I move this to a different repo or to discuss-gnustep
?
Do you have a reduced test case (ideally linking nothing but the runtime)?
It looks as if it may be trying to send a message in +load to a class that is not yet loaded.
I think I might have forgotten to link something. I don't have a minimal test case -- as I said, I have no idea where this came from.
I looked and it seems like I have linked everything.
Do you have a reduced test case (ideally linking nothing but the runtime)?
It looks as if it may be trying to send a message in +load to a class that is not yet loaded.
Is the order of the +load messages defined? It doesn't occur in the libs-opal testcases, so if this is the reason for the segfault, then the load message order is different.
@davidchisnall I believe this is the problem...
+[CGImageDestinationTIFF load] calls [CGImageDestination registerDestinationClass: self];
. I believe this might be UB, so should this call be moved to +initialize
?
+ (void)load
{
[CGImageDestination registerDestinationClass: self];
}
I changed those to +initialize
and it now segfaults in objc_msgSend_fpret
in a call to +[NSString load].
Sending a message to a superclass in +load should be fine. Sending one to any other class that’s defined in the same library should also be fine.
I wonder if somehow the selector for the message being sent is not resolved. That could result in an out of bounds access in the uninitialised stable.
I have changed +load
to +initialize
and disabled UTILoad()
, which results in:
Unknown protocol version
Program received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737260606656) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737260606656) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140737260606656) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140737260606656, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007ffff6c42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007ffff6c287f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x00007ffff79ccc77 in init_protocols () from /usr/GNUstep/Local/Library/Libraries/libobjc.so.4.6
#6 0x00007ffff79ccafa in objc_init_protocols () from /usr/GNUstep/Local/Library/Libraries/libobjc.so.4.6
#7 0x00007ffff79c648d in objc_load_class () from /usr/GNUstep/Local/Library/Libraries/libobjc.so.4.6
#8 0x00007ffff79cc201 in __objc_load () from /usr/GNUstep/Local/Library/Libraries/libobjc.so.4.6
#9 0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffd398, env=env@entry=0x7fffffffd3a8)
at ./elf/dl-init.c:70
#10 0x00007ffff7fc9568 in call_init (env=0x7fffffffd3a8, argv=0x7fffffffd398, argc=1, l=<optimized out>) at ./elf/dl-init.c:33
#11 _dl_init (main_map=0x7ffff7ffe2e0, argc=1, argv=0x7fffffffd398, env=0x7fffffffd3a8) at ./elf/dl-init.c:117
#12 0x00007ffff7fe32ca in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#13 0x0000000000000001 in ?? ()
#14 0x00007fffffffd88b in ?? ()
--Type <RET> for more, q to quit, c to continue without paging--
#15 0x0000000000000000 in ?? ()
@davidchisnall Might this be related? The abort appears in pthreads in objc_init_protocols()
.
Does Apple objc4 work on Linux? If it does, it might be possible to test on objc4 to see if the issue is in libobjc2.
@davidchisnall How can I compile libobjc2 with debug symbols included?
I passed -DCMAKE_BUILD_TYPE=Debug
to CMake
I'm kinda busy, but here's my last GDB session if anyone wants to take a look:
Program stopped.
0x00007ffff7fe3290 in _start () from /lib64/ld-linux-x86-64.so.2
(gdb) b protocol.c:224
No source file named protocol.c.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (protocol.c:224) pending.
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/ethan/Projects/GNUstep/Porting/GitUp/Examples/GitY/GitY.app/GitY
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, init_protocols (protocols=0x7ffff5ecb020 <objc_protocol_list>) at /home/ethan/Projects/GNUstep/plaurent2/GNUstep-build/libobjc2/protocol.c:224
224 fprintf(stderr, "Unknown protocol version");
(gdb) p version
$1 = 0
(gdb) p aProto
$2 = (struct objc_protocol *) 0x7ffff5ed4a50 <._OBJC_PROTOCOL_CAAction>
(gdb) p *aProto
$3 = {isa = 0x0, name = 0x7ffff5ebdfa8 "CAAction", protocol_list = 0x7ffff5ed4b50 <objc_protocol_list>,
instance_methods = 0x7ffff5ecad40 <objc_protocol_method_list>, class_methods = 0x7ffff5ecad60 <objc_protocol_method_list>,
optional_instance_methods = 0x7ffff5ecad58 <objc_protocol_method_list>, optional_class_methods = 0x7ffff5ecad68 <objc_protocol_method_list>,
properties = 0x0, optional_properties = 0x0, class_properties = 0x0, optional_class_properties = 0x0}
(gdb)
@davidchisnall The issue appears to be that the isa of CAAction is 0x0, which is not a valid protocol version. I don't know why it ended up this way.
What is the -fobjc-runtime= flag that you’re passing to clang?
@davidchisnall gnustep-config
says that I'm using -fobjc-runtime=2.1
:
$ gnustep-config --gui-libs
-fuse-ld=/usr/bin/ld.gold -L/usr/local/lib -pthread -fexceptions -rdynamic -fobjc-runtime=gnustep-2.1 -fblocks -L/home/ethan/GNUstep/Library/Libraries -L/usr/GNUstep/Local/Library/Libraries -L/usr/GNUstep/System/Library/Libraries -lgnustep-gui -lgnustep-base -lpthread -lobjc -lm
I can try switching to the newly released v2.2.1
I'm building like:
export CC=clang-14
export CXX=clang++-14
export CXXFLAGS="-std=c++11"
export RUNTIME_VERSION=gnustep-2.1
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
export LD=/usr/bin/ld.gold
export LDFLAGS="-fuse-ld=/usr/bin/ld.gold -L/usr/local/lib"
rm -Rf build
mkdir build && cd build
cmake ../ \
-DCMAKE_C_COMPILER=${CC} \
-DCMAKE_CXX_COMPILER=${CXX} \
-DCMAKE_ASM_COMPILER=${CC} \
-DCMAKE_LINKER=${LD} \
-DUSE_GOLD_LINKER=YES \
-DCMAKE_MODULE_LINKER_FLAGS="${LDFLAGS}" \
-DTESTS=OFF \
-DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_EXPORT_COMPILE_COMMANDS=1
cmake --build .
sudo -E make install
sudo ldconfig
The issue still appears in v2.2.1.
clang 14 is pretty old, but should be fine. There are a few fixes for the v2 ABI after that, but I think they matter only on Windows.
I wonder if the +load is somehow being called before we've loaded the Protocol
class. The Protocol
class should be linked into the runtime and the runtime's own constructors should be called before anything that links to it, but it's possible that this isn't happening on your platform. That shouldn't be the case, because if init_protocol_classes
returns NO
then we don't reach the link that's failing for you.
@davidchisnall Is there anything that I should run in gdb
to look into your suspicions? Also, is there anyone else we should get to look at this? Should I post this to discuss-gnustep
?
I wonder if the +load is somehow being called before we've loaded the Protocol class.
I think this issue is unrelated to the +load
issue. Here's the backtrace from protocol.c:224
:
#0 init_protocols (protocols=0x7ffff5ecb020 <objc_protocol_list>) at /home/ethan/Projects/GNUstep/plaurent2/GNUstep-build/libobjc2/protocol.c:224
#1 0x00007ffff79ccafa in objc_init_protocols (protocols=0x7ffff726c238 <objc_protocol_list>)
at /home/ethan/Projects/GNUstep/plaurent2/GNUstep-build/libobjc2/protocol.c:271
#2 0x00007ffff79c648d in objc_load_class (class=0x7ffff726c300 <._OBJC_CLASS_GCGitObject>)
at /home/ethan/Projects/GNUstep/plaurent2/GNUstep-build/libobjc2/class_table.c:465
#3 0x00007ffff79cc201 in __objc_load (init=0x7ffff7266c58 <objc_init>) at /home/ethan/Projects/GNUstep/plaurent2/GNUstep-build/libobjc2/loader.c:268
#4 0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffd398, env=env@entry=0x7fffffffd3a8)
at ./elf/dl-init.c:70
#5 0x00007ffff7fc9568 in call_init (env=0x7fffffffd3a8, argv=0x7fffffffd398, argc=1, l=<optimized out>) at ./elf/dl-init.c:33
#6 _dl_init (main_map=0x7ffff7ffe2e0, argc=1, argv=0x7fffffffd398, env=0x7fffffffd3a8) at ./elf/dl-init.c:117
#7 0x00007ffff7fe32ca in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#8 0x0000000000000001 in ?? ()
#9 0x00007fffffffd88b in ?? ()
#10 0x0000000000000000 in ?? ()
It runs into NSCopying
(isa 0x4
) and NSCoding
(isa 0x4
) with no issues before reaching CAAction (isa 0x0
).
@davidchisnall Are you available to look at this more or not? Should I post this to discuss-gnustep
?
I’m completely confused by this. It looks as if your binary contains things that shouldn’t be possible.
I have tested again with a new installation of GNUstep on the same computer with clang-18, which results in the same segfault in SparseArrayLookup. I'll go apply my patches to Opal and Boron and see what happens.
I should also try to do this on a clean Debian system, but I currently don't have the disk space to do that.
I get "Unknown protocol versionaborted" error again, with my Opal and Boron patches applied on clang-18.
By the way, for most repositories that I haven't patched I'm on the latest commit to master as of May 31. I built it with tools-scripts
this time around.
Is there something unusual about CAAction's protocol definition? (Is it missing a definition?)
I can't usefully help debug this without a test case that doesn't depend on a big pile of other libraries.
I hadn't noticed previously that you were using gold. Does the same thing happen if you use lld?
Is there something unusual about CAAction's protocol definition? (Is it missing a definition?)
No, it looks perfectly normal.
@protocol CAAction
@required
- (void)runActionForKey:(NSString *)key object:(id)anObject arguments:(NSDictionary *)dict;
@end
I can't usefully help debug this without a test case that doesn't depend on a big pile of other libraries.
I'll try to make a test case that only depends on QuartzCore (which contains CAAction) and its dependencies.
I hadn't noticed previously that you were using gold. Does the same thing happen if you use lld?
I'll try to use lld.
Using ld.lld-18, I get during make
of the example binary:
/usr/bin/ld: /usr/GNUstep/Local/Library/Libraries/libGitUpKit.so: undefined reference to `CGContextStrokeEllipseInRect'
/usr/bin/ld: /usr/GNUstep/Local/Library/Libraries/libGitUpKit.so: undefined reference to `CGContextFillEllipseInRect'
This might be related to the issue, but I'm a bit concerned that these error messages come from ld
instead of ld.lld-18
. Is that normal?
Fixing that issue with
void CGContextFillEllipseInRect(CGContextRef ctx, CGRect rect)
{
NSWarnLog(@"CGContextFillEllipseInRect is not implemented");
}
void CGContextStrokeEllipseInRect(CGContextRef ctx, CGRect rect)
{
NSWarnLog(@"CGContextStrokeEllipseInRect is not implemented");
}
gives me Unknown protocol versionaborted
again.
The issue may be that QuartzCore doesn't seem to link to the libraries it uses, including Opal and OpenGL. Here's a smaller example I tried to make, but it doesn't run into the CAAction 0x0 issue: https://github.com/ethanc8/CAActionPlayground
(for my reference)
export CC=clang-18
export CXX=clang++-18
export CXXFLAGS="-std=c++11"
export RUNTIME_VERSION=gnustep-2.1
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
export LD=/usr/bin/ld.lld-18
export LDFLAGS="-fuse-ld=/usr/bin/ld.lld-18 -L/usr/local/lib"
rm -Rf build
mkdir build && cd build
cmake ../ \
-DCMAKE_C_COMPILER=${CC} \
-DCMAKE_CXX_COMPILER=${CXX} \
-DCMAKE_ASM_COMPILER=${CC} \
-DCMAKE_LINKER=${LD} \
-DUSE_GOLD_LINKER=YES \
-DCMAKE_MODULE_LINKER_FLAGS="${LDFLAGS}" \
-DTESTS=OFF \
-DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_EXPORT_COMPILE_COMMANDS=1
cmake --build .
sudo -E make install
sudo ldconfig
Huh, init_protocols
never finds CAAction.
Ok, by adding a category that implements CAAction, I got it to process
(gdb)
$3 = {isa = 0x4, name = 0x55555555705a "CAAction", protocol_list = 0x555555559500 <objc_protocol_list>,
instance_methods = 0x5555555592c0 <objc_protocol_method_list>, class_methods = 0x5555555592e0 <objc_protocol_method_list>,
optional_instance_methods = 0x5555555592d8 <objc_protocol_method_list>, optional_class_methods = 0x5555555592e8 <objc_protocol_method_list>, properties = 0x0,
optional_properties = 0x0, class_properties = 0x0, optional_class_properties = 0x0}
Now the isa is 0x4.
With only a class implementing CAAction, however, init_protocols
never processes CAAction.
Compare the GitY
example:
(gdb)
$3 = {isa = 0x0, name = 0x7ffff6a9e466 "CAAction", protocol_list = 0x7ffff6ab3b60 <objc_protocol_list>,
instance_methods = 0x7ffff6aa9d50 <objc_protocol_method_list>, class_methods = 0x7ffff6aa9d70 <objc_protocol_method_list>,
optional_instance_methods = 0x7ffff6aa9d68 <objc_protocol_method_list>, optional_class_methods = 0x7ffff6aa9d78 <objc_protocol_method_list>, properties = 0x0,
optional_properties = 0x0, class_properties = 0x0, optional_class_properties = 0x0}
As I recall, protocols must be adopted by at least one class or category to exist. I’m a bit surprised CAAction isn’t used by anything in Opal though.
There’s probably a compiler bug that is causing a non-adopted protocol to be emitted, but I’m not sure how it’s generated but not in the correct section. Can you send me the preprocessed source for the compilation unit that generates the .o file that contains the symbol for this protocol?
There’s probably a compiler bug that is causing a non-adopted protocol to be emitted, but I’m not sure how it’s generated but not in the correct section. Can you send me the preprocessed source for the compilation unit that generates the .o file that contains the symbol for this protocol?
Do you know how to find the associated compilation unit?
nm will dump the symbols for each .o file, you should be able to find the CAAction protocol in that list for one file. Once you know that, find the .m file that generated it and add -E and -o {some temp file} to the compile command.
Ok, thanks.
I found 0000000000000000 D ._OBJC_PROTOCOL_CAAction
in CAAnimation.m.o
.
I'll attach the output of -E
at https://gist.github.com/ethanc8/46be627ad36326f9b32843fe6409d77d. I wanted to also try -rewrite-objc -fno-ms-extensions -fpermissive
, but it looks like libs-quartzcore has a lot of obviously incorrect code (like assigning BOOLs to variables of type id) that only works in C, and not in C++. -rewrite-legacy-objc -fno-ms-extensions -fpermissive
caused a compiler crash (which I reported at https://github.com/llvm/llvm-project/issues/94380).
(for myself)
for file in *.o; do echo $file; nm $file; done | grep CAAction
This is especially perplexing because nothing in libs-quartzcore implements the protocol CAAction, there are just methods that take and return CAAction values (returned values are ones that the user previously provided to libs-quartzcore, so libs-quartzcore is not producing any id._OBJC_PROTOCOL_CAAction
, so maybe it gets it because it's the first file to be compiled that imports CAAction.h?
Does something in that file do @protocol(CAAction)
? Can you attack the preprocessed source for CAAnimation.m and I'll take a look.
Here is the preprocessed source (I posted it above, but might not have been clear): https://gist.github.com/ethanc8/46be627ad36326f9b32843fe6409d77d
The word CAAction
appears nowhere in the un-preprocessed source.
Wait, I didn't see this:
@interface CAAnimation : NSObject <NSCoding, NSCopying, CAAction, CAMediaTiming>
Yup, looks like the protocol is adopted, so it is correct for it to be emitted in this file. But for some reason it's not initialised correctly. Looks like a compiler bug. If you can create a reduced test case that would be helpful (delete bits of the file, compile with -S -emit-llvm
and see if the resulting file has 0 for the first field of _OBJC_PROTOCOL_CAAction
), otherwise I'll try to look at it later in the week.
Ok, thanks. I'm not sure whether it's a compiler bug, since it seems to be initialised correctly for my test case which creates classes and categories implementing CAAction and instantiates CAAnimation.
Hello 👋, I have encountered a segfault in
SparseArrayLookup
during the initialization of my port of the application GitY. Here is the backtrace from GDB:Interestingly, I have not encountered this when running any of the libs-opal tests, such as
images
. I have confirmed with GDB thatimages
calls+[CGImageDestinationTIFF load]
without segfaulting. Therefore, I don't know in which component this came from.The Objective-C components that are linked to by
GitY
are:It's possible that one of those has messed something up in their
+load
methods, but I'm not exactly sure. If needed, I can send you binaries of my GNUstep installation, compilelibobjc2
with different flags, peek around in GDB, or whatever else would be needed.