Closed turekt closed 1 year ago
Thanks so much for your contribution! We will review this shortly as soon as possible.
I have one question for you however, could you provide an example of a case where the function VAs between Ghidra and GoReSym differ? I ask because GoReSym has historically had one or two bugs where VAs were not recovered correctly and were off by some fixed offset, this sounds somewhat similar to what you may be seeing so I would like to first ensure you haven't found a GoReSym bug. A binary to review and the json output of GoResym, or even just an image would be more than enough for me to understand what you may be seeing
Hi @stevemk14ebr The wrong VA address still existed, not right fixed I test the main GoResym right with the GoResym exe on Windows.
@HongThatCong I will investigate this starting next week, thank you very much.
Hi,
you can basically use any stripped binary since I have observed that all binaries needed an offset of some sort. For instance, with a simple hello world binary stripped in two ways with strip
and ldflags
:
$ cat main.go
package main
import "fmt"
func main() {
fmt.Println("Hello world!")
}
$ go build -o main_strip main.go
$ strip main_strip
$ go build -o main_ldstrip -ldflags "-s -w" main.go
$ GoReSym -d -p -t main_strip > main_strip.json
$ GoReSym -d -p -t main_ldstrip > main_ldstrip.json
$ jq .TabMeta.VA main_strip.json main_ldstrip.json
4951232
4951008
After loading these into Ghidra, the script correctly estimated:
main_strip
main_ldstrip
What I am seeing from GoReSym output file is a start address of 5475264:
$ jq .UserFunctions main_strip.json
[
{
"Start": 5475264,
"End": 5475367,
"PackageName": "main",
"FullName": "main.main"
}
]
If I go to that address in Ghidra, I end up in the .bss section:
00538bc0 ?? ??
I am not sure, but since I was getting different offsets for different binaries, I assumed that this is just a side effect of how Ghidra loads things during disassembly.
Thanks for the review! I have made changes based on your comments except for the Enum
usage for list of choices.
In addition to your recommended changes, I have tested the script with Python3 in Ghidratron and made few simple changes in order to support both Jython/Python2.7 in Ghidra and Python3 in Ghidratron.
This script should now work in both Ghidra and Ghidratron.
great work @turekt!
I have been able to reproduce the issue with the VAs reported by GoReSym being offset by some value. It appears this affects ELF binaries only.
Hi @stevemk14ebr,
let me just note that I saw the same behaviour for PE files as well.
I have managed to reproduce this for PE as follows:
$ cat main.go
package main
import "fmt"
func main() {
fmt.Println("Hello world!")
}
$ GOOS=windows GOARCH=amd64 go build -ldflags "-s -w" -o main_windows_ldstrip main.go
$ GoReSym -d -t -p main_windows_ldstrip > mainwin_ldstrip.json
$ jq .TabMeta.VA mainwin_ldstrip.json
5043712
When I open this address in Ghidra, the TabMeta.VA maps correctly (I see the correct signature):
DAT_004cf600 XREF[1]: 00530960(*)
004cf600 f0 ?? F0h
004cf601 ff ?? FFh
004cf602 ff ?? FFh
004cf603 ff ?? FFh
...
So I tried to map the Start
of main.main
:
$ jq .UserFunctions[0] mainwin_ldstrip.json
{
"Start": 5358720,
"End": 5358823,
"PackageName": "main",
"FullName": "main.main"
}
And when I go to that address in Ghidra, I end up in .rdata:
0051c480 ff ?? FFh
0051c481 ff ?? FFh
0051c482 ff ?? FFh
0051c483 ff ?? FFh
0051c484 88 ?? 88h
0051c485 b9 ?? B9h
0051c486 00 ?? 00h
0051c487 00 ?? 00h
Running the script shows the following results with available strategies:
entry to _rt0_ function mapping renamed 1076 and created 371 functions
pclntab to TabMeta VA mapping renamed 0 and created 1447 functions
known function names mapping renamed 0 and created 1447 functions
no estimation renamed 0 and created 1447 functions
After choosing the recommended strategy and setting the estimated offset of -581632, main.main
ends up at 0x48e480 with correct disassembly shown:
**************************************************************
* FUNCTION *
**************************************************************
undefined main.main()
...
0048e480 49 3b 66 10 CMP RSP, qword ptr [R14 + 0x10]
0048e484 76 56 JBE LAB_0048e4dc
0048e486 48 83 ec 40 SUB RSP, 0x40
0048e48a 48 89 6c MOV qword ptr [RSP + local_8], RBP
0048e48f 48 8d 6c LEA RBP=>local_8, [RSP + 0x38]
0048e494 44 0f 11 MOVUPS xmmword ptr [RSP + local_18[0]], XMM15
0048e49a 48 8d 15 LEA RDX, [string]
0048e4a1 48 89 54 MOV qword ptr [RSP + local_18[0]], RDX=>string
0048e4a6 48 8d 15 LEA RDX, [PTR_DAT_004cb840]
0048e4ad 48 89 54 MOV qword ptr [RSP + local_18[8]], RDX=>PTR_DAT_00
0048e4b2 48 8b 1d MOV RBX, qword ptr [DAT_00542a70]
0048e4b9 48 8d 05 LEA RAX, [interface_io.Writer_impl_*os.File]
0048e4c0 48 8d 4c LEA RCX=>local_18, [RSP + 0x28]
0048e4c5 bf 01 00 MOV EDI, 0x1
0048e4ca 48 89 fe MOV RSI, RDI
0048e4cd e8 2e aa CALL fmt.Fprintln
0048e4d2 48 8b 6c MOV RBP=>local_8, qword ptr [RSP + 0x38]
0048e4d7 48 83 c4 40 ADD RSP, 0x40
0048e4db c3 RET
Oh interesting, I'll investigate further!
I have located the bug! The issue is that in this routine here (https://github.com/golang/go/blob/027ff3f47d5d6557067324c342c8e14d7da1cf7a/src/debug/gosym/pclntab.go#L411), go adds the section base to resolve the final VA, I use the base of the current "candidate" rather than the actual text section base address. If you subtract the section base of the section that contains the pclntab from the .text section base address it will be the 'correction' offset that your script is finding.
Not sure about a good fix yet, stay tuned, and thanks for reporting. I'll merge this after this bug is addressed as we may not need the logic to locate the offset anymore.
Hi @stevemk14ebr,
nice catch! I have checked your conclusion against test binaries and I can confirm that your conclusion is correct for most go binaries except for one case when using cgo. I have put reproduction steps for this case in the section below the comment.
Because of this cgo case, I feel that it might be better to leave the offset calculation logic in the final version of this script to be able to cover more use cases. We can maybe discuss to change the script so that it performs the initial estimation and checks whether no offset gives the most function mappings - if it does, it can skip the dialog to specify offset estimation strategy and offset value.
Let's see how the final fix is going to look. We can decide whether or not to leave the offset calculation logic after that since the fix may cover the cgo case as well. If I can help you with the fix in any way, just let me know.
P.S. There was a bug in the script when GoReSym was not executed with -d
flag with an exception concatenating NoneType and list or something along those lines. This is now fixed.
EDIT: I have reproduced your CGO issue. Thanks for reporting this one too!
I am looking into resolving the .text base bug you previously reported. I agree with your reasoning we should keep the logic but adjust it to first check if an offset of 0 is acceptable, would you be able to commit that? Once you've made that modification I will merge this PR, and then we can handle these bugs in separate issues going forward.
Ok I have a solution I think will work to resolve both of these. Let me explain the issue you reported with CGO first though.
The large offset you've seen is mostly due to the previous bug of me using the incorrect section base. Assuming I use the .text section base, the mismatch becomes much smaller, an offset of only 0x100 is required.
.text base (via file header): 0x4022a0 Correct: 0x45AB20 Incorrect: 0x45AA20
That small delta is apparently due to something the go linker does (for cgo only?). If we look at the .text section in a disassembler and then compare it with the .text address stored in the moduledata structure we see a mismatch:
.text (via file header) base: 0x4022a0 .text (via moduledata) base: 0x4023A0
off by exactly 0x100. I do not know why CGO does this, but it seems it shoves some shimming code into the .text at the start of the real .text section. It then adjust the pointer of the moduledata structure's .text base by that size so that in other places where go internally resolves these symbols ends up being correct. I tried looking at the code of the compiler to find why they would possibly do it this way but I can't see a reason.
Regardless, the correct solution would appear to be rather simple. First, we can continue to resolve the symbols how we do now, just using the segment base of the section that the pclntab is in. This will result in incorrect VAs, BUT we should be able to find symbols still, and this should also be enough to locate the moduledata. As soon as we find the moduledata, we will need to read out the .text field it stores to get the 'real' (faked) .text section address that go actually uses. We then re-do symbol reolution using this new correct .text base. We already bother locating and parsing the moduledata so this shouldn't be much extra work to parse that out and re-try.
The reason I am bothering to parse the pclntab twice in this way is because packed executables totally screw with section names and sizes, so it is not safe for us to just locate the .text via the file headers.
I will be implementing this in the coming days and testing this, expect a resolution soon!
EDIT: Resolved! Please try out https://github.com/mandiant/GoReSym/releases/tag/v2.0 when you can :)
Hi @stevemk14ebr,
great work! I have tested the 2.0 version against test binaries and the offset issue seems to be patched for these cases.
I still wanted to check whether there could be another edge case so I tested GoReSym and its Ghidra script against go c-shared object binary:
$ cat main.go
package main
//#include <stdio.h>
//
//void hello(void) {
// printf("Hello from C!");
//}
import "C"
func main() {
C.hello()
}
$ go install -buildmode=shared -linkshared std
$ go build -buildmode c-shared -ldflags="-s -w" -o main.so main.go
$ GoReSym -d -p -t main.so > main.so.json
Results coming from the Ghidra script:
no estimation renamed 0 and created 1070 functions
entry to _rt0_ function mapping renamed 12 and created 1058 functions
pclntab to TabMeta VA mapping renamed 0 and created 1070 functions
known function names mapping renamed 799 and created 271
If I use GoReSym given offsets:
$ jq .UserFunctions main.so.json
[
{
"Start": 438080,
"End": 438176,
"PackageName": "main",
"FullName": "main._Cfunc_hello"
},
{
"Start": 438176,
"End": 438230,
"PackageName": "main",
"FullName": "main.main"
}
]
I end up nowhere (expected for shared object):
No results for 0x6afa0
Executing known function names
strategy determines an offset of 1048576:
**************************************************************
* FUNCTION *
**************************************************************
undefined main.main()
...
0016afa0 49 3b 66 10 CMP RSP, qword ptr [R14 + 0x10]
0016afa4 76 29 JBE LAB_0016afcf
0016afa6 48 83 ec 08 SUB RSP, 0x8
0016afaa 48 89 2c 24 MOV qword ptr [RSP]=>local_8, RBP
0016afae 48 8d 2c 24 LEA RBP=>local_8, [RSP]
0016afb2 e8 89 ff CALL main._Cfunc_hello
0016afb7 45 0f 57 ff XORPS XMM15, XMM15
0016afbb 4c 8b 35 MOV R14, qword ptr [PTR_001d4fc8]
0016afc2 64 4d 8b 36 MOV R14, qword ptr FS:[R14]
0016afc6 48 8b 2c 24 MOV RBP=>local_8, qword ptr [RSP]
0016afca 48 83 c4 08 ADD RSP, 0x8
0016afce c3 RET
Not sure if I can somehow force Ghidra to map the shared object to specific addresses found by GoReSym.
In any case, I have updated the goresym_rename.py
so that it runs all strategies and if it detects that no offset gives the most renames, it will not prompt for strategy/offset input.
Additionally, I have found one case where UserFunctions
are set to null in GoReSym output so I have made another change in the extract_funcs
function to check for both UserFunctions
and StdFunctions
before adding them to the list of functions in order to prevent getting the concatenating NoneType to list error.
Awesome! I will merge this then, thanks so much for your contribution and bug report(s). Do you know how Ghidra determines the base address to load the shared object at? I can just mirror that logic in GoReSym if it's a standard thing.
Hi @stevemk14ebr,
good question. I checked and here are my observations: Ghidra loads ELF files starting always from 0x100000 (ELF header). The address values where other functions will get mapped are determined via offsets written in section .dynsym
.
By using our last shared object file as an example, we can check the hello
function .dynsym
entry:
00100950 93 00 00 00 12 Elf64_Sym [57]
00100950 93 00 00 00 ddw 93h st_name ; hello
00100954 12 db 12h st_info
00100955 00 db 0h st_other
00100956 0d 00 dw Dh st_shndx
00100958 e0 af 06 00 00 dq 6AFE0h st_value
00100960 17 00 00 00 00 dq 17h st_size
The st_value
is our offset added to the base where the ELF header is mapped, so the hello function shoud be located at 0x100000+0x6afe0=0x16afe0.
Checking that address reveals the disassembly of the hello
function:
**************************************************************
* FUNCTION *
**************************************************************
undefined hello()
...
0016afe0 f3 0f 1e fa ENDBR64
0016afe4 48 8d 35 LEA RSI, [s_Hello_from_C!_001849f1]
0016afeb bf 01 00 MOV EDI, 0x1
0016aff0 31 c0 XOR EAX, EAX
0016aff2 e9 d9 71 JMP LAB_001121d0
What I don't see here is how stripped functions are laid out, but if I use fixed base offset of 0x100000 on all Start
values of functions found by GoReSym, I end up getting correct disassemblies.
Please note that I have only checked this for ELF amd64 shared objects, results might be different for other formats (e.g. DLL) or architectures.
Ok I took a look at this and compared it to how IDA loads the .so. I am seeing differing behavior here between what IDA chooses as the base and what Ghidra does, so I'm not going to be integrating this into GoReSym itself. This ends up being great that we kept the offset finding logic you've created!
Hi,
I have noticed that GoReSym information can be imported to IDA, but there is no script for information import into Ghidra. This PR adds a Ghidra script which imports information from GoReSym json output file.
Since I have noticed that GoReSym start addresses do not map exactly 1-to-1 with Ghidra function addresses, the script supports several ways of performing an estimation of the offset when renaming functions. The offset can be specified manually as well.
I have tested this against several stripped binaries and the script seems to be working well.
Let me know what you think.