golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.52k stars 17.6k forks source link

runtime: cgo call to symbol from library loaded dynamically will panic with go 1.21.1 and ld >2.38 #63264

Closed braydonk closed 1 year ago

braydonk commented 1 year ago

What version of Go are you using (go version)?

$ go version
go version go1.21.1 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/braydonk/.cache/go-build'
GOENV='/home/braydonk/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/braydonk/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/braydonk/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.1'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/braydonk/Git/cgo_dl_repro/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1951551231=/tmp/go-build -gno-record-gcc-switches'

What did you do?

I created a minimal reproduction setup at https://github.com/braydonk/cgo_dl_repro

In this scenario, I have a header file that references a single function get42 that I will get from a shared object, which I will load at runtime with dlopen. The ld flags -Wl,--unresolved-symbols=ignore-in-object-files are used.

First, I run make liblib, which will compile the C file in this repo that implements the get42 function and then turn it into a shared object.
Then I run go run .

What did you expect to see?

In go1.20.8, and in go1.21.1 with ld version 2.34, I get the expected result:

braydonk@braydonk:~/Git/cgo_dl_repro$ go run .
get42 address:  0x7fb06a1fe0f9
42

What did you see instead?

In go1.21 with an ld version > 2.38 I get a panic:

braydonk@braydonk:~/Git/cgo_dl_repro$ go run .
get42 address:  0x7f2c601c00f9
SIGSEGV: segmentation violation
PC=0x0 m=0 sigcode=1
signal arrived during cgo execution

goroutine 1 [syscall]:
runtime.cgocall(0x48a800, 0xc000065eb8)
        /usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc000065e90 sp=0xc000065e58 pc=0x40590b
main._Cfunc_get42()
        _cgo_gotypes.go:139 +0x47 fp=0xc000065eb8 sp=0xc000065e90 pc=0x48a007
main.main()
        /home/braydonk/Git/cgo_dl_repro/main.go:24 +0xf9 fp=0xc000065f40 sp=0xc000065eb8 pc=0x48a6b9
runtime.main()
        /usr/local/go/src/runtime/proc.go:267 +0x2bb fp=0xc000065fe0 sp=0xc000065f40 pc=0x435e9b
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000065fe8 sp=0xc000065fe0 pc=0x45f901

goroutine 2 [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000050fa8 sp=0xc000050f88 pc=0x4362ee
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:404
runtime.forcegchelper()
        /usr/local/go/src/runtime/proc.go:322 +0xb3 fp=0xc000050fe0 sp=0xc000050fa8 pc=0x436173
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000050fe8 sp=0xc000050fe0 pc=0x45f901
created by runtime.init.6 in goroutine 1
        /usr/local/go/src/runtime/proc.go:310 +0x1a

goroutine 3 [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000051778 sp=0xc000051758 pc=0x4362ee
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:404
runtime.bgsweep(0x0?)
        /usr/local/go/src/runtime/mgcsweep.go:280 +0x94 fp=0xc0000517c8 sp=0xc000051778 pc=0x422c14
runtime.gcenable.func1()
        /usr/local/go/src/runtime/mgc.go:200 +0x25 fp=0xc0000517e0 sp=0xc0000517c8 pc=0x417fa5
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000517e8 sp=0xc0000517e0 pc=0x45f901
created by runtime.gcenable in goroutine 1
        /usr/local/go/src/runtime/mgc.go:200 +0x66

goroutine 4 [GC scavenge wait]:
runtime.gopark(0xc00007a000?, 0x4c5128?, 0x1?, 0x0?, 0xc0000071e0?)
        /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000051f70 sp=0xc000051f50 pc=0x4362ee
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:404
runtime.(*scavengerState).park(0x53bfe0)
        /usr/local/go/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc000051fa0 sp=0xc000051f70 pc=0x4204a9
runtime.bgscavenge(0x0?)
        /usr/local/go/src/runtime/mgcscavenge.go:653 +0x3c fp=0xc000051fc8 sp=0xc000051fa0 pc=0x420a3c
runtime.gcenable.func2()
        /usr/local/go/src/runtime/mgc.go:201 +0x25 fp=0xc000051fe0 sp=0xc000051fc8 pc=0x417f45
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000051fe8 sp=0xc000051fe0 pc=0x45f901
created by runtime.gcenable in goroutine 1
        /usr/local/go/src/runtime/mgc.go:201 +0xa5

goroutine 5 [finalizer wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000052628 sp=0xc000052608 pc=0x4362ee
runtime.runfinq()
        /usr/local/go/src/runtime/mfinal.go:193 +0x107 fp=0xc0000527e0 sp=0xc000052628 pc=0x417027
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000527e8 sp=0xc0000527e0 pc=0x45f901
created by runtime.createfing in goroutine 1
        /usr/local/go/src/runtime/mfinal.go:163 +0x3d

rax    0x0
rbx    0xc000065eb8
rcx    0xc000065eb8
rdx    0xc000065e48
rdi    0xc000065eb8
rsi    0x53c080
rbp    0xc000065e48
rsp    0x7ffd04bb7088
r8     0x53c460
r9     0x0
r10    0x1
r11    0x206
r12    0xc000066000
r13    0x53c460
r14    0xc0000061a0
r15    0x8
rip    0x0
rflags 0x10246
cs     0x33
fs     0x0
gs     0x0
exit status 2

Additional Info

This seems to be a result of how CGO handles --unresolved-symbols=ignore-in-object-files. The unresolved symbol results in SIGSEGV because the address of the symbols is 0x0. In go1.20.8 when I completely eschew the dlopen step and just try to call C.get42() without loading anything, I get an unresolved symbol lookup error:

braydonk@braydonk:~/Git/cgo_dl_repro$ go run .
/tmp/go-build1699640599/b001/exe/cgo_dl_repro: symbol lookup error: /tmp/go-build1699640599/b001/exe/cgo_dl_repro: undefined symbol: get42
exit status 127

However in go1.21.1, I get a panic identical to calling it after loading the library.

Different ld versions

My setup for testing the different ld versions was actually by changing distros entirely. I have my personal machine which is on a Rolling Debian Testing distro, and VMs on Debian Bullseye (11), Ubuntu Jammy (22.04), and Ubuntu Focal (20.04). The panic in go1.21.1 occurs on every OS expect Ubuntu Focal, and the only difference I could think of was the lower ld version, which is why I have called that out, BUT technically there could be some other secret difference that is causing this which I missed.

Why the strange setup?

This setup case may seem very oddly specific. I am mirroring the setup used by NVIDIA's Go NVML bindings; we discovered this error through our usage of that library. See https://github.com/NVIDIA/go-nvml/issues/36, particularly you'll want to scroll down to the newest comments which talk about how this specific breakage happened after upgrading to go1.21.1.

braydonk commented 1 year ago

I tried building the binary in my reproduction repro with go1.20.8 and go1.21.1 and then ran nm to check the symbols in each binary.

Go 1.20:

braydonk@braydonk:~/Git/cgo_dl_repro$ nm cgo_dl_repro | grep get42
0000000000483740 T _cgo_59b4640d347f_Cfunc_get42
                 U get42
0000000000483580 t main._Cfunc_get42.abi0
000000000051a1c8 d main._cgo_59b4640d347f_Cfunc_get42

Go 1.21:

braydonk@braydonk:~/Git/cgo_dl_repro$ nm cgo_dl_repro | grep get42
000000000047ce70 T _cgo_59b4640d347f_Cfunc_get42
000000000047ccc0 t main._Cfunc_get42.abi0
00000000005191a8 d main._cgo_59b4640d347f_Cfunc_get42

So in this case it didn't even show up as an undefined symbol in Go 1.21.1. I think this would explain why it panics in Go 1.21; in Go 1.20 the symbol is there as an undefined symbol, which is why it works with a symbol lookup error in Go 1.20, and (I think; this is all new to me) why we can find the symbol after dlopen.

braydonk commented 1 year ago

Tried it in an Ubuntu 20.04 VM (ld version 2.34) with Go 1.21, and got the same result as compiling with Go 1.20 on ld version 2.41:

braydonk@focal-test:~/cgo_dl_repro$ nm cgo_dl_repro | grep get42
000000000048a820 T _cgo_b06122c1f854_Cfunc_get42
                 U get42
0000000000489fc0 t main._Cfunc_get42.abi0
00000000005321e8 d main._cgo_b06122c1f854_Cfunc_get42
braydonk commented 1 year ago

I'm trying to rule out ld by trying out a different linker. Using Go 1.21 with the following build command:

go build -a --ldflags '-extldflags "-fuse-ld=gold -Wl,--unresolved-symbols=ignore-in-object-files"' .

By checking with nm the unresolved symbol is there:

braydonk@braydonk:~/Git/cgo_dl_repro$ nm cgo_dl_repro | grep get42
000000000047bff0 T _cgo_59b4640d347f_Cfunc_get42
                 U get42
000000000047be40 t main._Cfunc_get42.abi0
00000000005181a8 d main._cgo_59b4640d347f_Cfunc_get42

However, running the binary still results in a panic. So I guess the difference in symbols in the executable isn't the root cause here.

braydonk commented 1 year ago

Tried another trick borrowed from the earlier linked go-nvml issue discussion, using the --weak-unresolved-symbols flag for gold. This resulted in a panic as well. (This was kind of silly and a red herring, because in this case with a weak symbol where I don't load the library the panic probably makes sense. It also panics in Go 1.20)

braydonk commented 1 year ago

Actually, this works. In the last comment, I was testing by just trying to call the symbol and not loading the library. However, with gold and --weak-unresolved-symbols and running dlopen it works with Go 1.21.

braydonk commented 1 year ago

In my reproduction repro, I ran the cgo command directly on main.go (with any references to the other file commented out) and the all generated output between go1.21.1 and go1.20.8 was identical (at least according to my attempts to diff the two generated _obj folders with meld).

braydonk commented 1 year ago

I debugged two binaries built with go1.20.8 and go1.21.1 respectively, using the reproduction repo but commenting out the part where the dynamic library is loaded. This is a run where the expected output would be a symbol lookup error.

CGO output

In the generated CGO output, the generated C function _cgo_1dc841591e27_Cfunc_get42:

CGO_NO_SANITIZE_THREAD
void
_cgo_1dc841591e27_Cfunc_get42(void *v)
{
    struct {
        int r;
        char __pad4[4];
    } __attribute__((__packed__, __gcc_struct__)) *_cgo_a = v;
    char *_cgo_stktop = _cgo_topofstack();
    __typeof__(_cgo_a->r) _cgo_r;
    _cgo_tsan_acquire();
    _cgo_r = get42();
    _cgo_tsan_release();
    _cgo_a = (void*)((char*)_cgo_a + (_cgo_topofstack() - _cgo_stktop));
    _cgo_a->r = _cgo_r;
    _cgo_msan_write(&_cgo_a->r, sizeof(_cgo_a->r));
}

Go 1.21

At the line _cgo_r = get42(), in go1.21.1, the program segfaults. Here's the GDB output with a few steps of context:

_cgo_topofstack () at /usr/local/go/src/runtime/asm_amd64.s:1645
1645        RET
(gdb) info registers
rax            0xc000038800        824633952256
rbx            0xc0000386f8        824633951992
rcx            0xc0000386f8        824633951992
rdx            0xc000038688        824633951880
rsi            0x533100            5452032
rdi            0xc0000386f8        824633951992
rbp            0xc000038688        0xc000038688
rsp            0x7fffffffe298      0x7fffffffe298
r8             0x5334e0            5453024
r9             0x0                 0
r10            0x410               1040
r11            0xffffffffffffffff  -1
r12            0x100               256
r13            0x6a                106
r14            0xc0000061a0        824633745824
r15            0x4                 4
rip            0x45ed98            0x45ed98 <_cgo_topofstack+24>
eflags         0x216               [ PF AF IF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
(gdb) step
_cgo_effbaea66e62_Cfunc_get42 (v=0xc0000386f8) at /tmp/go-build/cgo-gcc-prolog:52
52  /tmp/go-build/cgo-gcc-prolog: No such file or directory.
(gdb) info registers
rax            0xc000038800        824633952256
rbx            0xc0000386f8        824633951992
rcx            0xc0000386f8        824633951992
rdx            0xc000038688        824633951880
rsi            0x533100            5452032
rdi            0xc0000386f8        824633951992
rbp            0xc000038688        0xc000038688
rsp            0x7fffffffe2a0      0x7fffffffe2a0
r8             0x5334e0            5453024
r9             0x0                 0
r10            0x410               1040
r11            0xffffffffffffffff  -1
r12            0xc000038800        824633952256
r13            0x6a                106
r14            0xc0000061a0        824633745824
r15            0x4                 4
rip            0x485a03            0x485a03 <_cgo_effbaea66e62_Cfunc_get42+19>
eflags         0x216               [ PF AF IF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
(gdb) step

Thread 1 "cgo_dl_repro" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) info registers
rax            0x0                 0
rbx            0xc0000386f8        824633951992
rcx            0xc0000386f8        824633951992
rdx            0xc000038688        824633951880
rsi            0x533100            5452032
rdi            0xc0000386f8        824633951992
rbp            0xc000038688        0xc000038688
rsp            0x7fffffffe298      0x7fffffffe298
r8             0x5334e0            5453024
r9             0x0                 0
r10            0x410               1040
r11            0xffffffffffffffff  -1
r12            0xc000038800        824633952256
r13            0x6a                106
r14            0xc0000061a0        824633745824
r15            0x4                 4
rip            0x0                 0x0
eflags         0x10246             [ PF ZF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
(gdb) 

Go 1.20

In go1.20.8, when that same line in the generated CGO is reached, it moves on to dl_signal_exception. GDB output starting from the same spot as the above example:

_cgo_topofstack () at /home/braydonk/go_versions/go1.20.8/go/src/runtime/asm_amd64.s:1593
1593        RET
(gdb) step
_cgo_effbaea66e62_Cfunc_get42 (v=0xc00004bf38) at /tmp/go-build/cgo-gcc-prolog:52
52  /tmp/go-build/cgo-gcc-prolog: No such file or directory.
(gdb) step
__GI___libc_malloc (bytes=69) at ./malloc/malloc.c:3287
3287    ./malloc/malloc.c: No such file or directory.
(gdb) step
3294    in ./malloc/malloc.c
(gdb) step
3299    in ./malloc/malloc.c
(gdb) step
checked_request2size (sz=<synthetic pointer>, req=69) at ./malloc/malloc.c:1343
1343    in ./malloc/malloc.c
(gdb) finish
Run till exit from #0  checked_request2size (sz=<synthetic pointer>, req=69) at ./malloc/malloc.c:1343
__GI___libc_malloc (bytes=69) at ./malloc/malloc.c:3299
3299    in ./malloc/malloc.c
(gdb) finish
Run till exit from #0  __GI___libc_malloc (bytes=69) at ./malloc/malloc.c:3299
0x00007ffff7fc7cca in malloc (size=69) at ../include/rtld-malloc.h:56
56  ../include/rtld-malloc.h: No such file or directory.
Value returned is $2 = (void *) 0x566820
(gdb) step
__GI__dl_signal_exception (errcode=0, exception=0x7fffffffde50, occasion=0x7ffff7ff0ecd "symbol lookup error") at ./elf/dl-error-skeleton.c:91
91  ./elf/dl-error-skeleton.c: No such file or directory.
(gdb) step
92  in ./elf/dl-error-skeleton.c
(gdb) step
93  in ./elf/dl-error-skeleton.c
(gdb) step
102 in ./elf/dl-error-skeleton.c
(gdb) finish
warning: Function __GI__dl_signal_exception does not return normally.
Try to finish anyway? (y or n) y
Run till exit from #0  __GI__dl_signal_exception (errcode=0, exception=0x7fffffffde50, 
    occasion=0x7ffff7ff0ecd "symbol lookup error") at ./elf/dl-error-skeleton.c:102
/home/braydonk/cgo_dl_repro_120/cgo_dl_repro: symbol lookup error: /home/braydonk/cgo_dl_repro_120/cgo_dl_repro: undefined symbol: get42
[Thread 0x7fffcf9e8640 (LWP 25416) exited]
[Thread 0x7fffd01e9640 (LWP 25415) exited]
[Thread 0x7fffd09ea640 (LWP 25414) exited]
[Thread 0x7fffcf1a7640 (LWP 25417) exited]
[Inferior 1 (process 25413) exited with code 0177]
(gdb) 
braydonk commented 1 year ago

Added a new experiment in the reproduction repo where I wrote a small C program that attempts to get symbol resolution the same way that worked in go1.20.8; unresolved symbols ignore in object files, call dlopen, and expect a function call to work.

When compiled with gcc 9.4.0 (Focal) and gcc 11.4.0 (Jammy), the program segfaulted at the function call.

When compiled with gcc version 13.2.0 (Rolling Debian Testing), the program produced the error ./main: error while loading shared libraries: unexpected PLT reloc type 0x00.

I'm sure CGO's version of "call this function from the header" is different than C's version of "call this function from the header", although when I look at the cgo generation it does look like it just calls the function kind of the same way.

Admittedly it does seem off to me that a dlopen earlier in the program would cause a previously unresolved symbol to just work; usually with dlopen you call into stuff from the dynamic library through a dlsym lookup. I figured there must be some magic dlopen does at runtime that I wasn't aware of.

Either way, it is very strange that go1.20.8 does not crash in this scenario on any system I've tested, and that go1.21.1 only worked on my Focal test system when the cgo generation is the same, and that when I do the same thing manually in C it segfaults on all systems.

thanm commented 1 year ago

@golang/compiler

braydonk commented 1 year ago

On my Rolling Debian Testing machine, I did a go tool objdump on a binary built with go1.20.8 and a binary built with go1.21.1. This is without the dlopen, just trying to call the unknown symbol.

Go 1.20.8

TEXT _cgo_49665a31f432_Cfunc_get42(SB) 
  :0            0x483740        4154            PUSHQ R12           
  :0            0x483742        55          PUSHQ BP            
  :0            0x483743        53          PUSHQ BX            
  :0            0x483744        4889fb          MOVQ DI, BX         
  :0            0x483747        e894c1fdff      CALL _cgo_topofstack(SB)    
  :0            0x48374c        4989c4          MOVQ AX, R12            
  :0            0x48374f        31c0            XORL AX, AX         
  :0            0x483751        e85ae9f7ff      CALL 0x4020b0           
  :0            0x483756        89c5            MOVL AX, BP         
  :0            0x483758        e883c1fdff      CALL _cgo_topofstack(SB)    
  :0            0x48375d        4c29e0          SUBQ R12, AX            
  :0            0x483760        892c03          MOVL BP, 0(BX)(AX*1)        
  :0            0x483763        5b          POPQ BX             
  :0            0x483764        5d          POPQ BP             
  :0            0x483765        415c            POPQ R12            
  :0            0x483767        c3          RET             

Go 1.21.1

TEXT _cgo_49665a31f432_Cfunc_get42(SB) 
  :0            0x47ce70        4154            PUSHQ R12           
  :0            0x47ce72        55          PUSHQ BP            
  :0            0x47ce73        53          PUSHQ BX            
  :0            0x47ce74        4889fb          MOVQ DI, BX         
  :0            0x47ce77        e88416feff      CALL _cgo_topofstack(SB)    
  :0            0x47ce7c        4989c4          MOVQ AX, R12            
  :0            0x47ce7f        31c0            XORL AX, AX         
  :0            0x47ce81        e87a31b8ff      CALL 0x0            
  :0            0x47ce86        89c5            MOVL AX, BP         
  :0            0x47ce88        e87316feff      CALL _cgo_topofstack(SB)    
  :0            0x47ce8d        4c29e0          SUBQ R12, AX            
  :0            0x47ce90        892c03          MOVL BP, 0(BX)(AX*1)        
  :0            0x47ce93        5b          POPQ BX             
  :0            0x47ce94        5d          POPQ BP             
  :0            0x47ce95        415c            POPQ R12            
  :0            0x47ce97        c3          RET             

In the Go 1.21.1 dump, the generated cgo binding generates a CALL 0x0 at instruction 0x47ce88, which is the instruction to call get42 from the generated cgo function I showed in https://github.com/golang/go/issues/63264#issuecomment-1738418035. In the Go 1.20.8 compilation this is not address CALL 0x0, but CALL 0x4020b0 instead (instruction 0x483751). Not sure what that might be referring to.

braydonk commented 1 year ago

Could 0x402000 be the PLT from the program header?

braydonk commented 1 year ago

The 0x402000 is a PT_LOAD in the built exe header, using a build of go from master that I just downloaded:

braydonk@braydonk:~/Git/cgo_dl_repro$ readelf --segments with_dev_go 

Elf file type is EXEC (Executable file)
Entry point 0x402330
There are 14 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000400040 0x0000000000400040
                 0x0000000000000310 0x0000000000000310  R      0x8
  INTERP         0x0000000000000350 0x0000000000400350 0x0000000000400350
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x0000000000001158 0x0000000000001158  R      0x1000
  LOAD           0x0000000000002000 0x0000000000402000 0x0000000000402000
                 0x000000000007b74d 0x000000000007b74d  R E    0x1000
  LOAD           0x000000000007e000 0x000000000047e000 0x000000000047e000
                 0x000000000009bfc0 0x000000000009bfc0  R      0x1000
  LOAD           0x000000000011adf0 0x000000000051adf0 0x000000000051adf0
                 0x0000000000009ab0 0x000000000003b940  RW     0x1000
  DYNAMIC        0x000000000011ae00 0x000000000051ae00 0x000000000051ae00
                 0x00000000000001d0 0x00000000000001d0  RW     0x8
  NOTE           0x0000000000000370 0x0000000000400370 0x0000000000400370
                 0x0000000000000020 0x0000000000000020  R      0x8
  NOTE           0x0000000000000390 0x0000000000400390 0x0000000000400390
                 0x00000000000000a8 0x00000000000000a8  R      0x4
  TLS            0x000000000011adf0 0x000000000051adf0 0x000000000051adf0
                 0x0000000000000000 0x0000000000000008  R      0x8
  GNU_PROPERTY   0x0000000000000370 0x0000000000400370 0x0000000000400370
                 0x0000000000000020 0x0000000000000020  R      0x8
  GNU_EH_FRAME   0x0000000000119880 0x0000000000519880 0x0000000000519880
                 0x0000000000000154 0x0000000000000154  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x000000000011adf0 0x000000000051adf0 0x000000000051adf0
                 0x0000000000000210 0x0000000000000210  R      0x1

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.gnu.property .note.gnu.build-id .note.ABI-tag .note.go.buildid .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt 
   03     .init .plt .text .fini 
   04     .rodata .typelink .itablink .gopclntab .eh_frame_hdr .eh_frame 
   05     .init_array .fini_array .dynamic .got .got.plt .data .go.buildinfo .noptrdata .bss .noptrbss 
   06     .dynamic 
   07     .note.gnu.property 
   08     .note.gnu.build-id .note.ABI-tag .note.go.buildid 
   09     .tbss 
   10     .note.gnu.property 
   11     .eh_frame_hdr 
   12     
   13     .init_array .fini_array .dynamic .got 

I do think it's the PLT. So I guess perhaps in the PLT itself this unresolved symbol doesn't have a section like I expect it would and perhaps that explains why CALL 0x0 is being generated?

braydonk commented 1 year ago

I have now confirmed that get42 is not added to the PLT when building with go1.21.1. As a result, when referring to this symbol in the generated cgo bindings, it's just generating CALL 0x0. In the go1.20.8 build, You can see at this address it is in the PLT:

(gdb) x/3i 0x4020b0
   0x4020b0 <get42@plt>:    jmp    *0x117f8a(%rip)        # 0x51a040 <get42@got.plt>
   0x4020b6 <get42@plt+6>:  push   $0x8
   0x4020bb <get42@plt+11>: jmp    0x402020

get42@plt is not present in the go1.21.1 build.

braydonk commented 1 year ago

This code has me suspicious, however I tried it with the if target.IsExternal() commented out and that didn't seem to fix it. https://github.com/golang/go/blob/5351bcf8225747f0ef39afc44c0499822992ed11/src/cmd/link/internal/amd64/asm.go#L248-L263

braydonk commented 1 year ago

I'm new to actually working with the Go codebase. I tried to add some log.Printfs to the asm.go file, but I must be missing a trick to actually see wherever those logs are coming from (or it's just not hitting the adddynrel function at all)

braydonk commented 1 year ago

It seems that the code from go tool link is never hit in a build of this application. I guess I don't really understand how it fits together. :thinking: When I tried adding some prints to cmd/cgo, specifically to look at the opened elf objects to get the symbols, it seems at that point the get42 symbol isn't in those yet (in both go1.20.8 and go1.21.1), so it depends on when cgo actually compiles the cgo-gcc-prolog stuff, cause the assembly from that is what generates the 0x0 instead of a PLT offset for the get42 symbol.

thanm commented 1 year ago

Stupid question: if you are loading up a library using dlopen() already, why not just use "dlsym" to find the address of the function you are interested in and call it that way?

FYI one thing that I think can help when working on these sorts of problems us to use the Go linker's "-tmpdir" option. Example:

$ rm -rf /tmp/xxx $ mkdir /tmp/xxx $ go build -ldflags=-tmpdir=/tmp/xxx mycgoprogram.go $ ls /tmp/xxx 000000.o 000005.o 000010.o 000015.o go.dwarf 000001.o 000006.o 000011.o 000016.o go.o 000002.o 000007.o 000012.o 000017.o trivial.c 000003.o 000008.o 000013.o 000018.o 000004.o 000009.o 000014.o a.out $

The object files in /tmp/xxx are going to be the ones passed to the external linker in the final step, so it is a good spot where you can inspect them (both Go and C objects to see what's going on).

braydonk commented 1 year ago

Stupid question: if you are loading up a library using dlopen() already, why not just use "dlsym" to find the address of the function you are interested in and call it that way?

No, it is a good question. This is generally the best way to do this and what I would do if I wrote it myself.
The reason I am interested in the pattern I'm messing with here is what I mentioned at the end of the original comment on this issue; we discovered the bug when we tried to use https://github.com/NVIDIA/go-nvml with go1.21.1. It does this exact same pattern as in my reproduction; it has an nvml.h with all the functions from the shared object and includes that in the build with --unresolved-symbols=ignore-in-object-files ld flag, calls dlopen when initializing, and then calls the direct CGO bindings instead of looking up each symbol (it does look up symbols, but only to verify their presence not to actually call into them).

FYI one thing that I think can help when working on these sorts of problems us to use the Go linker's "-tmpdir" option.

Great idea, thank you! I didn't notice this flag when looking through options. I'll give that a try.

braydonk commented 1 year ago

A git bisect produced this commit as the origin of the behaviour change: https://github.com/golang/go/commit/1f29f39795e736238200840c368c4e0c6edbfbae

The result of my issue seems to be here: https://github.com/golang/go/blob/122b35e838af8ab9c0d5027741d6f73cef09f966/src/cmd/link/internal/ld/lib.go#L1682-L1691 When I forced this into the old behaviour (always adding -rdynamic to argv) my reproduction worked as expected. So in my reproduction, I tried adding -Wl,--export-dynamic and building with go1.21.1 it worked.

So I'm tempted to say this isn't really a bug. This is just a strange behaviour in this particular case when -rdynamic isn't added unilaterally like it was before.

I suppose I'll ping @ianlancetaylor in case he's interested since it was his change, but looking at the original issue from the change I think it makes sense to stay the way it is now (at least based on what I understand). So I'm going to suggest to the go-nvml maintainers that this flag be added to their LDFLAGS.

I will now close this issue. Thanks Than for the suggestions!

thanm commented 1 year ago

Good detective work @braydonk . Yeah in retrospect the export dynamic change would seem to make sense given what you described.

ianlancetaylor commented 1 year ago

Thanks for digging into this.