Closed braydonk closed 1 year ago
I tried building the binary in my reproduction repro with go1.20.8
and go1.21.1
and then ran nm
to check the symbols in each binary.
Go 1.20:
braydonk@braydonk:~/Git/cgo_dl_repro$ nm cgo_dl_repro | grep get42
0000000000483740 T _cgo_59b4640d347f_Cfunc_get42
U get42
0000000000483580 t main._Cfunc_get42.abi0
000000000051a1c8 d main._cgo_59b4640d347f_Cfunc_get42
Go 1.21:
braydonk@braydonk:~/Git/cgo_dl_repro$ nm cgo_dl_repro | grep get42
000000000047ce70 T _cgo_59b4640d347f_Cfunc_get42
000000000047ccc0 t main._Cfunc_get42.abi0
00000000005191a8 d main._cgo_59b4640d347f_Cfunc_get42
So in this case it didn't even show up as an undefined symbol in Go 1.21.1. I think this would explain why it panics in Go 1.21; in Go 1.20 the symbol is there as an undefined symbol, which is why it works with a symbol lookup error
in Go 1.20, and (I think; this is all new to me) why we can find the symbol after dlopen
.
Tried it in an Ubuntu 20.04 VM (ld
version 2.34) with Go 1.21, and got the same result as compiling with Go 1.20 on ld
version 2.41:
braydonk@focal-test:~/cgo_dl_repro$ nm cgo_dl_repro | grep get42
000000000048a820 T _cgo_b06122c1f854_Cfunc_get42
U get42
0000000000489fc0 t main._Cfunc_get42.abi0
00000000005321e8 d main._cgo_b06122c1f854_Cfunc_get42
I'm trying to rule out ld
by trying out a different linker. Using Go 1.21 with the following build command:
go build -a --ldflags '-extldflags "-fuse-ld=gold -Wl,--unresolved-symbols=ignore-in-object-files"' .
By checking with nm
the unresolved symbol is there:
braydonk@braydonk:~/Git/cgo_dl_repro$ nm cgo_dl_repro | grep get42
000000000047bff0 T _cgo_59b4640d347f_Cfunc_get42
U get42
000000000047be40 t main._Cfunc_get42.abi0
00000000005181a8 d main._cgo_59b4640d347f_Cfunc_get42
However, running the binary still results in a panic. So I guess the difference in symbols in the executable isn't the root cause here.
Tried another trick borrowed from the earlier linked
(This was kind of silly and a red herring, because in this case with a weak symbol where I don't load the library the panic probably makes sense. It also panics in Go 1.20)go-nvml
issue discussion, using the --weak-unresolved-symbols
flag for gold
. This resulted in a panic as well.
Actually, this works. In the last comment, I was testing by just trying to call the symbol and not loading the library. However, with gold
and --weak-unresolved-symbols
and running dlopen
it works with Go 1.21.
In my reproduction repro, I ran the cgo
command directly on main.go
(with any references to the other file commented out) and the all generated output between go1.21.1
and go1.20.8
was identical (at least according to my attempts to diff the two generated _obj
folders with meld
).
I debugged two binaries built with go1.20.8
and go1.21.1
respectively, using the reproduction repo but commenting out the part where the dynamic library is loaded. This is a run where the expected output would be a symbol lookup error
.
In the generated CGO output, the generated C function _cgo_1dc841591e27_Cfunc_get42
:
CGO_NO_SANITIZE_THREAD
void
_cgo_1dc841591e27_Cfunc_get42(void *v)
{
struct {
int r;
char __pad4[4];
} __attribute__((__packed__, __gcc_struct__)) *_cgo_a = v;
char *_cgo_stktop = _cgo_topofstack();
__typeof__(_cgo_a->r) _cgo_r;
_cgo_tsan_acquire();
_cgo_r = get42();
_cgo_tsan_release();
_cgo_a = (void*)((char*)_cgo_a + (_cgo_topofstack() - _cgo_stktop));
_cgo_a->r = _cgo_r;
_cgo_msan_write(&_cgo_a->r, sizeof(_cgo_a->r));
}
At the line _cgo_r = get42()
, in go1.21.1
, the program segfaults. Here's the GDB output with a few steps of context:
_cgo_topofstack () at /usr/local/go/src/runtime/asm_amd64.s:1645
1645 RET
(gdb) info registers
rax 0xc000038800 824633952256
rbx 0xc0000386f8 824633951992
rcx 0xc0000386f8 824633951992
rdx 0xc000038688 824633951880
rsi 0x533100 5452032
rdi 0xc0000386f8 824633951992
rbp 0xc000038688 0xc000038688
rsp 0x7fffffffe298 0x7fffffffe298
r8 0x5334e0 5453024
r9 0x0 0
r10 0x410 1040
r11 0xffffffffffffffff -1
r12 0x100 256
r13 0x6a 106
r14 0xc0000061a0 824633745824
r15 0x4 4
rip 0x45ed98 0x45ed98 <_cgo_topofstack+24>
eflags 0x216 [ PF AF IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb) step
_cgo_effbaea66e62_Cfunc_get42 (v=0xc0000386f8) at /tmp/go-build/cgo-gcc-prolog:52
52 /tmp/go-build/cgo-gcc-prolog: No such file or directory.
(gdb) info registers
rax 0xc000038800 824633952256
rbx 0xc0000386f8 824633951992
rcx 0xc0000386f8 824633951992
rdx 0xc000038688 824633951880
rsi 0x533100 5452032
rdi 0xc0000386f8 824633951992
rbp 0xc000038688 0xc000038688
rsp 0x7fffffffe2a0 0x7fffffffe2a0
r8 0x5334e0 5453024
r9 0x0 0
r10 0x410 1040
r11 0xffffffffffffffff -1
r12 0xc000038800 824633952256
r13 0x6a 106
r14 0xc0000061a0 824633745824
r15 0x4 4
rip 0x485a03 0x485a03 <_cgo_effbaea66e62_Cfunc_get42+19>
eflags 0x216 [ PF AF IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb) step
Thread 1 "cgo_dl_repro" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) info registers
rax 0x0 0
rbx 0xc0000386f8 824633951992
rcx 0xc0000386f8 824633951992
rdx 0xc000038688 824633951880
rsi 0x533100 5452032
rdi 0xc0000386f8 824633951992
rbp 0xc000038688 0xc000038688
rsp 0x7fffffffe298 0x7fffffffe298
r8 0x5334e0 5453024
r9 0x0 0
r10 0x410 1040
r11 0xffffffffffffffff -1
r12 0xc000038800 824633952256
r13 0x6a 106
r14 0xc0000061a0 824633745824
r15 0x4 4
rip 0x0 0x0
eflags 0x10246 [ PF ZF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb)
In go1.20.8
, when that same line in the generated CGO is reached, it moves on to dl_signal_exception
. GDB output starting from the same spot as the above example:
_cgo_topofstack () at /home/braydonk/go_versions/go1.20.8/go/src/runtime/asm_amd64.s:1593
1593 RET
(gdb) step
_cgo_effbaea66e62_Cfunc_get42 (v=0xc00004bf38) at /tmp/go-build/cgo-gcc-prolog:52
52 /tmp/go-build/cgo-gcc-prolog: No such file or directory.
(gdb) step
__GI___libc_malloc (bytes=69) at ./malloc/malloc.c:3287
3287 ./malloc/malloc.c: No such file or directory.
(gdb) step
3294 in ./malloc/malloc.c
(gdb) step
3299 in ./malloc/malloc.c
(gdb) step
checked_request2size (sz=<synthetic pointer>, req=69) at ./malloc/malloc.c:1343
1343 in ./malloc/malloc.c
(gdb) finish
Run till exit from #0 checked_request2size (sz=<synthetic pointer>, req=69) at ./malloc/malloc.c:1343
__GI___libc_malloc (bytes=69) at ./malloc/malloc.c:3299
3299 in ./malloc/malloc.c
(gdb) finish
Run till exit from #0 __GI___libc_malloc (bytes=69) at ./malloc/malloc.c:3299
0x00007ffff7fc7cca in malloc (size=69) at ../include/rtld-malloc.h:56
56 ../include/rtld-malloc.h: No such file or directory.
Value returned is $2 = (void *) 0x566820
(gdb) step
__GI__dl_signal_exception (errcode=0, exception=0x7fffffffde50, occasion=0x7ffff7ff0ecd "symbol lookup error") at ./elf/dl-error-skeleton.c:91
91 ./elf/dl-error-skeleton.c: No such file or directory.
(gdb) step
92 in ./elf/dl-error-skeleton.c
(gdb) step
93 in ./elf/dl-error-skeleton.c
(gdb) step
102 in ./elf/dl-error-skeleton.c
(gdb) finish
warning: Function __GI__dl_signal_exception does not return normally.
Try to finish anyway? (y or n) y
Run till exit from #0 __GI__dl_signal_exception (errcode=0, exception=0x7fffffffde50,
occasion=0x7ffff7ff0ecd "symbol lookup error") at ./elf/dl-error-skeleton.c:102
/home/braydonk/cgo_dl_repro_120/cgo_dl_repro: symbol lookup error: /home/braydonk/cgo_dl_repro_120/cgo_dl_repro: undefined symbol: get42
[Thread 0x7fffcf9e8640 (LWP 25416) exited]
[Thread 0x7fffd01e9640 (LWP 25415) exited]
[Thread 0x7fffd09ea640 (LWP 25414) exited]
[Thread 0x7fffcf1a7640 (LWP 25417) exited]
[Inferior 1 (process 25413) exited with code 0177]
(gdb)
Added a new experiment in the reproduction repo where I wrote a small C program that attempts to get symbol resolution the same way that worked in go1.20.8
; unresolved symbols ignore in object files, call dlopen
, and expect a function call to work.
When compiled with gcc 9.4.0
(Focal) and gcc 11.4.0
(Jammy), the program segfaulted at the function call.
When compiled with gcc version 13.2.0
(Rolling Debian Testing), the program produced the error ./main: error while loading shared libraries: unexpected PLT reloc type 0x00
.
I'm sure CGO's version of "call this function from the header" is different than C's version of "call this function from the header", although when I look at the cgo
generation it does look like it just calls the function kind of the same way.
Admittedly it does seem off to me that a dlopen
earlier in the program would cause a previously unresolved symbol to just work; usually with dlopen
you call into stuff from the dynamic library through a dlsym
lookup. I figured there must be some magic dlopen
does at runtime that I wasn't aware of.
Either way, it is very strange that go1.20.8
does not crash in this scenario on any system I've tested, and that go1.21.1
only worked on my Focal test system when the cgo generation is the same, and that when I do the same thing manually in C it segfaults on all systems.
@golang/compiler
On my Rolling Debian Testing machine, I did a go tool objdump
on a binary built with go1.20.8
and a binary built with go1.21.1
. This is without the dlopen, just trying to call the unknown symbol.
TEXT _cgo_49665a31f432_Cfunc_get42(SB)
:0 0x483740 4154 PUSHQ R12
:0 0x483742 55 PUSHQ BP
:0 0x483743 53 PUSHQ BX
:0 0x483744 4889fb MOVQ DI, BX
:0 0x483747 e894c1fdff CALL _cgo_topofstack(SB)
:0 0x48374c 4989c4 MOVQ AX, R12
:0 0x48374f 31c0 XORL AX, AX
:0 0x483751 e85ae9f7ff CALL 0x4020b0
:0 0x483756 89c5 MOVL AX, BP
:0 0x483758 e883c1fdff CALL _cgo_topofstack(SB)
:0 0x48375d 4c29e0 SUBQ R12, AX
:0 0x483760 892c03 MOVL BP, 0(BX)(AX*1)
:0 0x483763 5b POPQ BX
:0 0x483764 5d POPQ BP
:0 0x483765 415c POPQ R12
:0 0x483767 c3 RET
TEXT _cgo_49665a31f432_Cfunc_get42(SB)
:0 0x47ce70 4154 PUSHQ R12
:0 0x47ce72 55 PUSHQ BP
:0 0x47ce73 53 PUSHQ BX
:0 0x47ce74 4889fb MOVQ DI, BX
:0 0x47ce77 e88416feff CALL _cgo_topofstack(SB)
:0 0x47ce7c 4989c4 MOVQ AX, R12
:0 0x47ce7f 31c0 XORL AX, AX
:0 0x47ce81 e87a31b8ff CALL 0x0
:0 0x47ce86 89c5 MOVL AX, BP
:0 0x47ce88 e87316feff CALL _cgo_topofstack(SB)
:0 0x47ce8d 4c29e0 SUBQ R12, AX
:0 0x47ce90 892c03 MOVL BP, 0(BX)(AX*1)
:0 0x47ce93 5b POPQ BX
:0 0x47ce94 5d POPQ BP
:0 0x47ce95 415c POPQ R12
:0 0x47ce97 c3 RET
In the Go 1.21.1 dump, the generated cgo binding generates a CALL 0x0
at instruction 0x47ce88
, which is the instruction to call get42
from the generated cgo function I showed in https://github.com/golang/go/issues/63264#issuecomment-1738418035. In the Go 1.20.8 compilation this is not address CALL 0x0
, but CALL 0x4020b0
instead (instruction 0x483751
). Not sure what that might be referring to.
Could 0x402000
be the PLT from the program header?
The 0x402000
is a PT_LOAD
in the built exe header, using a build of go
from master that I just downloaded:
braydonk@braydonk:~/Git/cgo_dl_repro$ readelf --segments with_dev_go
Elf file type is EXEC (Executable file)
Entry point 0x402330
There are 14 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
PHDR 0x0000000000000040 0x0000000000400040 0x0000000000400040
0x0000000000000310 0x0000000000000310 R 0x8
INTERP 0x0000000000000350 0x0000000000400350 0x0000000000400350
0x000000000000001c 0x000000000000001c R 0x1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x0000000000001158 0x0000000000001158 R 0x1000
LOAD 0x0000000000002000 0x0000000000402000 0x0000000000402000
0x000000000007b74d 0x000000000007b74d R E 0x1000
LOAD 0x000000000007e000 0x000000000047e000 0x000000000047e000
0x000000000009bfc0 0x000000000009bfc0 R 0x1000
LOAD 0x000000000011adf0 0x000000000051adf0 0x000000000051adf0
0x0000000000009ab0 0x000000000003b940 RW 0x1000
DYNAMIC 0x000000000011ae00 0x000000000051ae00 0x000000000051ae00
0x00000000000001d0 0x00000000000001d0 RW 0x8
NOTE 0x0000000000000370 0x0000000000400370 0x0000000000400370
0x0000000000000020 0x0000000000000020 R 0x8
NOTE 0x0000000000000390 0x0000000000400390 0x0000000000400390
0x00000000000000a8 0x00000000000000a8 R 0x4
TLS 0x000000000011adf0 0x000000000051adf0 0x000000000051adf0
0x0000000000000000 0x0000000000000008 R 0x8
GNU_PROPERTY 0x0000000000000370 0x0000000000400370 0x0000000000400370
0x0000000000000020 0x0000000000000020 R 0x8
GNU_EH_FRAME 0x0000000000119880 0x0000000000519880 0x0000000000519880
0x0000000000000154 0x0000000000000154 R 0x4
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RW 0x10
GNU_RELRO 0x000000000011adf0 0x000000000051adf0 0x000000000051adf0
0x0000000000000210 0x0000000000000210 R 0x1
Section to Segment mapping:
Segment Sections...
00
01 .interp
02 .interp .note.gnu.property .note.gnu.build-id .note.ABI-tag .note.go.buildid .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt
03 .init .plt .text .fini
04 .rodata .typelink .itablink .gopclntab .eh_frame_hdr .eh_frame
05 .init_array .fini_array .dynamic .got .got.plt .data .go.buildinfo .noptrdata .bss .noptrbss
06 .dynamic
07 .note.gnu.property
08 .note.gnu.build-id .note.ABI-tag .note.go.buildid
09 .tbss
10 .note.gnu.property
11 .eh_frame_hdr
12
13 .init_array .fini_array .dynamic .got
I do think it's the PLT. So I guess perhaps in the PLT itself this unresolved symbol doesn't have a section like I expect it would and perhaps that explains why CALL 0x0
is being generated?
I have now confirmed that get42
is not added to the PLT when building with go1.21.1
. As a result, when referring to this symbol in the generated cgo
bindings, it's just generating CALL 0x0
. In the go1.20.8
build, You can see at this address it is in the PLT:
(gdb) x/3i 0x4020b0
0x4020b0 <get42@plt>: jmp *0x117f8a(%rip) # 0x51a040 <get42@got.plt>
0x4020b6 <get42@plt+6>: push $0x8
0x4020bb <get42@plt+11>: jmp 0x402020
get42@plt
is not present in the go1.21.1
build.
This code has me suspicious, however I tried it with the if target.IsExternal()
commented out and that didn't seem to fix it.
https://github.com/golang/go/blob/5351bcf8225747f0ef39afc44c0499822992ed11/src/cmd/link/internal/amd64/asm.go#L248-L263
I'm new to actually working with the Go codebase. I tried to add some log.Printf
s to the asm.go
file, but I must be missing a trick to actually see wherever those logs are coming from (or it's just not hitting the adddynrel
function at all)
It seems that the code from go tool link
is never hit in a build of this application. I guess I don't really understand how it fits together. :thinking:
When I tried adding some prints to cmd/cgo
, specifically to look at the opened elf objects to get the symbols, it seems at that point the get42
symbol isn't in those yet (in both go1.20.8
and go1.21.1
), so it depends on when cgo
actually compiles the cgo-gcc-prolog
stuff, cause the assembly from that is what generates the 0x0
instead of a PLT offset for the get42
symbol.
Stupid question: if you are loading up a library using dlopen() already, why not just use "dlsym" to find the address of the function you are interested in and call it that way?
FYI one thing that I think can help when working on these sorts of problems us to use the Go linker's "-tmpdir" option. Example:
$ rm -rf /tmp/xxx $ mkdir /tmp/xxx $ go build -ldflags=-tmpdir=/tmp/xxx mycgoprogram.go $ ls /tmp/xxx 000000.o 000005.o 000010.o 000015.o go.dwarf 000001.o 000006.o 000011.o 000016.o go.o 000002.o 000007.o 000012.o 000017.o trivial.c 000003.o 000008.o 000013.o 000018.o 000004.o 000009.o 000014.o a.out $
The object files in /tmp/xxx are going to be the ones passed to the external linker in the final step, so it is a good spot where you can inspect them (both Go and C objects to see what's going on).
Stupid question: if you are loading up a library using dlopen() already, why not just use "dlsym" to find the address of the function you are interested in and call it that way?
No, it is a good question. This is generally the best way to do this and what I would do if I wrote it myself.
The reason I am interested in the pattern I'm messing with here is what I mentioned at the end of the original comment on this issue; we discovered the bug when we tried to use https://github.com/NVIDIA/go-nvml with go1.21.1
. It does this exact same pattern as in my reproduction; it has an nvml.h
with all the functions from the shared object and includes that in the build with --unresolved-symbols=ignore-in-object-files
ld flag, calls dlopen
when initializing, and then calls the direct CGO bindings instead of looking up each symbol (it does look up symbols, but only to verify their presence not to actually call into them).
FYI one thing that I think can help when working on these sorts of problems us to use the Go linker's "-tmpdir" option.
Great idea, thank you! I didn't notice this flag when looking through options. I'll give that a try.
A git bisect
produced this commit as the origin of the behaviour change: https://github.com/golang/go/commit/1f29f39795e736238200840c368c4e0c6edbfbae
The result of my issue seems to be here: https://github.com/golang/go/blob/122b35e838af8ab9c0d5027741d6f73cef09f966/src/cmd/link/internal/ld/lib.go#L1682-L1691
When I forced this into the old behaviour (always adding -rdynamic
to argv
) my reproduction worked as expected. So in my reproduction, I tried adding -Wl,--export-dynamic
and building with go1.21.1
it worked.
So I'm tempted to say this isn't really a bug. This is just a strange behaviour in this particular case when -rdynamic
isn't added unilaterally like it was before.
I suppose I'll ping @ianlancetaylor in case he's interested since it was his change, but looking at the original issue from the change I think it makes sense to stay the way it is now (at least based on what I understand). So I'm going to suggest to the go-nvml
maintainers that this flag be added to their LDFLAGS
.
I will now close this issue. Thanks Than for the suggestions!
Good detective work @braydonk . Yeah in retrospect the export dynamic change would seem to make sense given what you described.
Thanks for digging into this.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I created a minimal reproduction setup at https://github.com/braydonk/cgo_dl_repro
In this scenario, I have a header file that references a single function
get42
that I will get from a shared object, which I will load at runtime withdlopen
. Theld
flags-Wl,--unresolved-symbols=ignore-in-object-files
are used.First, I run
make liblib
, which will compile the C file in this repo that implements theget42
function and then turn it into a shared object.Then I run
go run .
What did you expect to see?
In
go1.20.8
, and ingo1.21.1
withld
version2.34
, I get the expected result:What did you see instead?
In
go1.21
with anld
version >2.38
I get a panic:Additional Info
This seems to be a result of how CGO handles
--unresolved-symbols=ignore-in-object-files
. The unresolved symbol results inSIGSEGV
because the address of the symbols is0x0
. Ingo1.20.8
when I completely eschew thedlopen
step and just try to callC.get42()
without loading anything, I get an unresolved symbol lookup error:However in
go1.21.1
, I get a panic identical to calling it after loading the library.Different
ld
versionsMy setup for testing the different
ld
versions was actually by changing distros entirely. I have my personal machine which is on a Rolling Debian Testing distro, and VMs on Debian Bullseye (11), Ubuntu Jammy (22.04), and Ubuntu Focal (20.04). The panic ingo1.21.1
occurs on every OS expect Ubuntu Focal, and the only difference I could think of was the lowerld
version, which is why I have called that out, BUT technically there could be some other secret difference that is causing this which I missed.Why the strange setup?
This setup case may seem very oddly specific. I am mirroring the setup used by NVIDIA's Go NVML bindings; we discovered this error through our usage of that library. See https://github.com/NVIDIA/go-nvml/issues/36, particularly you'll want to scroll down to the newest comments which talk about how this specific breakage happened after upgrading to
go1.21.1
.