mandiant / GoReSym

Go symbol recovery tool
MIT License
498 stars 62 forks source link

Import support for Ghidra #11

Closed turekt closed 1 year ago

turekt commented 1 year ago

Hi,

I have noticed that GoReSym information can be imported to IDA, but there is no script for information import into Ghidra. This PR adds a Ghidra script which imports information from GoReSym json output file.

Since I have noticed that GoReSym start addresses do not map exactly 1-to-1 with Ghidra function addresses, the script supports several ways of performing an estimation of the offset when renaming functions. The offset can be specified manually as well.

I have tested this against several stripped binaries and the script seems to be working well.

Let me know what you think.

stevemk14ebr commented 1 year ago

Thanks so much for your contribution! We will review this shortly as soon as possible.

I have one question for you however, could you provide an example of a case where the function VAs between Ghidra and GoReSym differ? I ask because GoReSym has historically had one or two bugs where VAs were not recovered correctly and were off by some fixed offset, this sounds somewhat similar to what you may be seeing so I would like to first ensure you haven't found a GoReSym bug. A binary to review and the json output of GoResym, or even just an image would be more than enough for me to understand what you may be seeing

HongThatCong commented 1 year ago

Hi @stevemk14ebr The wrong VA address still existed, not right fixed I test the main GoResym right with the GoResym exe on Windows. image image image

stevemk14ebr commented 1 year ago

@HongThatCong I will investigate this starting next week, thank you very much.

turekt commented 1 year ago

Hi,

you can basically use any stripped binary since I have observed that all binaries needed an offset of some sort. For instance, with a simple hello world binary stripped in two ways with strip and ldflags:

$ cat main.go
package main

import "fmt"

func main() {
    fmt.Println("Hello world!")
}
$ go build -o main_strip main.go
$ strip main_strip
$ go build -o main_ldstrip -ldflags "-s -w" main.go
$ GoReSym -d -p -t main_strip > main_strip.json
$ GoReSym -d -p -t main_ldstrip > main_ldstrip.json
$ jq .TabMeta.VA main_strip.json main_ldstrip.json 
4951232
4951008

After loading these into Ghidra, the script correctly estimated:

What I am seeing from GoReSym output file is a start address of 5475264:

$ jq .UserFunctions main_strip.json 
[
  {
    "Start": 5475264,
    "End": 5475367,
    "PackageName": "main",
    "FullName": "main.main"
  }
]

If I go to that address in Ghidra, I end up in the .bss section:

        00538bc0                 ??         ??

I am not sure, but since I was getting different offsets for different binaries, I assumed that this is just a side effect of how Ghidra loads things during disassembly.

turekt commented 1 year ago

Thanks for the review! I have made changes based on your comments except for the Enum usage for list of choices.

In addition to your recommended changes, I have tested the script with Python3 in Ghidratron and made few simple changes in order to support both Jython/Python2.7 in Ghidra and Python3 in Ghidratron.

This script should now work in both Ghidra and Ghidratron.

williballenthin commented 1 year ago

great work @turekt!

stevemk14ebr commented 1 year ago

I have been able to reproduce the issue with the VAs reported by GoReSym being offset by some value. It appears this affects ELF binaries only.

turekt commented 1 year ago

Hi @stevemk14ebr,

let me just note that I saw the same behaviour for PE files as well.

I have managed to reproduce this for PE as follows:

$ cat main.go
package main

import "fmt"

func main() {
    fmt.Println("Hello world!")
}
$ GOOS=windows GOARCH=amd64 go build -ldflags "-s -w" -o main_windows_ldstrip main.go
$ GoReSym -d -t -p main_windows_ldstrip > mainwin_ldstrip.json
$ jq .TabMeta.VA mainwin_ldstrip.json 
5043712

When I open this address in Ghidra, the TabMeta.VA maps correctly (I see the correct signature):

                     DAT_004cf600                                    XREF[1]:     00530960(*)  
004cf600 f0              ??         F0h
004cf601 ff              ??         FFh
004cf602 ff              ??         FFh
004cf603 ff              ??         FFh
...

So I tried to map the Start of main.main:

$ jq .UserFunctions[0] mainwin_ldstrip.json 
{
  "Start": 5358720,
  "End": 5358823,
  "PackageName": "main",
  "FullName": "main.main"
}

And when I go to that address in Ghidra, I end up in .rdata:

0051c480 ff              ??         FFh
0051c481 ff              ??         FFh
0051c482 ff              ??         FFh
0051c483 ff              ??         FFh
0051c484 88              ??         88h
0051c485 b9              ??         B9h
0051c486 00              ??         00h
0051c487 00              ??         00h

Running the script shows the following results with available strategies:

entry to _rt0_ function mapping renamed 1076 and created 371 functions
pclntab to TabMeta VA mapping renamed 0 and created 1447 functions
known function names mapping renamed 0 and created 1447 functions
no estimation renamed 0 and created 1447 functions

After choosing the recommended strategy and setting the estimated offset of -581632, main.main ends up at 0x48e480 with correct disassembly shown:

                     **************************************************************
                     *                          FUNCTION                          *
                     **************************************************************
                     undefined main.main()
...
0048e480 49 3b 66 10     CMP        RSP, qword ptr [R14 + 0x10]
0048e484 76 56           JBE        LAB_0048e4dc
0048e486 48 83 ec 40     SUB        RSP, 0x40
0048e48a 48 89 6c        MOV        qword ptr [RSP + local_8], RBP
0048e48f 48 8d 6c        LEA        RBP=>local_8, [RSP + 0x38]
0048e494 44 0f 11        MOVUPS     xmmword ptr [RSP + local_18[0]], XMM15
0048e49a 48 8d 15        LEA        RDX, [string]
0048e4a1 48 89 54        MOV        qword ptr [RSP + local_18[0]], RDX=>string
0048e4a6 48 8d 15        LEA        RDX, [PTR_DAT_004cb840]
0048e4ad 48 89 54        MOV        qword ptr [RSP + local_18[8]], RDX=>PTR_DAT_00
0048e4b2 48 8b 1d        MOV        RBX, qword ptr [DAT_00542a70]
0048e4b9 48 8d 05        LEA        RAX, [interface_io.Writer_impl_*os.File]
0048e4c0 48 8d 4c        LEA        RCX=>local_18, [RSP + 0x28]
0048e4c5 bf 01 00        MOV        EDI, 0x1
0048e4ca 48 89 fe        MOV        RSI, RDI
0048e4cd e8 2e aa        CALL       fmt.Fprintln
0048e4d2 48 8b 6c        MOV        RBP=>local_8, qword ptr [RSP + 0x38]
0048e4d7 48 83 c4 40     ADD        RSP, 0x40
0048e4db c3              RET
stevemk14ebr commented 1 year ago

Oh interesting, I'll investigate further!

stevemk14ebr commented 1 year ago

I have located the bug! The issue is that in this routine here (https://github.com/golang/go/blob/027ff3f47d5d6557067324c342c8e14d7da1cf7a/src/debug/gosym/pclntab.go#L411), go adds the section base to resolve the final VA, I use the base of the current "candidate" rather than the actual text section base address. If you subtract the section base of the section that contains the pclntab from the .text section base address it will be the 'correction' offset that your script is finding.

Not sure about a good fix yet, stay tuned, and thanks for reporting. I'll merge this after this bug is addressed as we may not need the logic to locate the offset anymore.

turekt commented 1 year ago

Hi @stevemk14ebr,

nice catch! I have checked your conclusion against test binaries and I can confirm that your conclusion is correct for most go binaries except for one case when using cgo. I have put reproduction steps for this case in the section below the comment.

Because of this cgo case, I feel that it might be better to leave the offset calculation logic in the final version of this script to be able to cover more use cases. We can maybe discuss to change the script so that it performs the initial estimation and checks whether no offset gives the most function mappings - if it does, it can skip the dialog to specify offset estimation strategy and offset value.

Let's see how the final fix is going to look. We can decide whether or not to leave the offset calculation logic after that since the fix may cover the cgo case as well. If I can help you with the fix in any way, just let me know.

P.S. There was a bug in the script when GoReSym was not executed with -d flag with an exception concatenating NoneType and list or something along those lines. This is now fixed.

cgo case reproduction steps I have created the following .go file that runs C code: ``` $ cat main.go package main //#include // //void hello(void) { // printf("Hello world!"); //} import "C" func main() { C.hello() } $ go build -o main main.go && strip main ``` The `TabMeta.VA` maps correctly to the `.gopclntab` section: ``` $ GoReSym -d -p -t main > main.json $ jq .TabMeta.VA main.json 4729632 ``` But when checking user functions, they map to addresses that end up in `.bss` section: ``` $ jq .UserFunctions main.json [ { "Start": 5091936, "End": 5092032, "PackageName": "main", "FullName": "main._Cfunc_hello" }, { "Start": 5092032, "End": 5092084, "PackageName": "main", "FullName": "main.main" } ] ``` In Ghidra: ``` 004db260 ?? ?? ... 004db2c0 ?? ?? ``` Afterwards, running the `goresym_rename.py` script gives the following output: ``` entry to _rt0_ function mapping renamed 0 and created 1064 functions pclntab to TabMeta VA mapping renamed 239 and created 1887 functions known function names mapping renamed 3178 and created 1072 functions no estimation renamed 0 and created 5312 functions ``` By choosing the `known function names mapping` strategy, we get an offset of -526208. The `.gopclntab` section is located at 0x482b20 and the `.text` section is located at 0x4022a0. If I try to calculate the offset by using the prior conclusion, I get 0x4022a0-0x482b20=-526464 After using the suggested offset of -526208, `main.main` ends up properly disassembled at address 0x45ab40: ``` ************************************************************** * FUNCTION * ************************************************************** undefined main.main() ... 0045ab40 49 3b 66 10 CMP RSP, qword ptr [R14 + 0x10] 0045ab44 76 27 JBE LAB_0045ab6d 0045ab46 48 83 ec 08 SUB RSP, 0x8 0045ab4a 48 89 2c 24 MOV qword ptr [RSP]=>local_8, RBP 0045ab4e 48 8d 2c 24 LEA RBP=>local_8, [RSP] 0045ab52 e8 89 ff CALL main._Cfunc_hello 0045ab57 45 0f 57 ff XORPS XMM15, XMM15 0045ab5b 64 4c 8b MOV R14, qword ptr FS:[0xfffffff8] 0045ab64 48 8b 2c 24 MOV RBP=>local_8, qword ptr [RSP] 0045ab68 48 83 c4 08 ADD RSP, 0x8 0045ab6c c3 RET ```
stevemk14ebr commented 1 year ago

EDIT: I have reproduced your CGO issue. Thanks for reporting this one too!

I am looking into resolving the .text base bug you previously reported. I agree with your reasoning we should keep the logic but adjust it to first check if an offset of 0 is acceptable, would you be able to commit that? Once you've made that modification I will merge this PR, and then we can handle these bugs in separate issues going forward.

stevemk14ebr commented 1 year ago

Ok I have a solution I think will work to resolve both of these. Let me explain the issue you reported with CGO first though.

The large offset you've seen is mostly due to the previous bug of me using the incorrect section base. Assuming I use the .text section base, the mismatch becomes much smaller, an offset of only 0x100 is required.

.text base (via file header): 0x4022a0 Correct: 0x45AB20 Incorrect: 0x45AA20

That small delta is apparently due to something the go linker does (for cgo only?). If we look at the .text section in a disassembler and then compare it with the .text address stored in the moduledata structure we see a mismatch:

.text (via file header) base: 0x4022a0 .text (via moduledata) base: 0x4023A0

off by exactly 0x100. I do not know why CGO does this, but it seems it shoves some shimming code into the .text at the start of the real .text section. It then adjust the pointer of the moduledata structure's .text base by that size so that in other places where go internally resolves these symbols ends up being correct. I tried looking at the code of the compiler to find why they would possibly do it this way but I can't see a reason.

Regardless, the correct solution would appear to be rather simple. First, we can continue to resolve the symbols how we do now, just using the segment base of the section that the pclntab is in. This will result in incorrect VAs, BUT we should be able to find symbols still, and this should also be enough to locate the moduledata. As soon as we find the moduledata, we will need to read out the .text field it stores to get the 'real' (faked) .text section address that go actually uses. We then re-do symbol reolution using this new correct .text base. We already bother locating and parsing the moduledata so this shouldn't be much extra work to parse that out and re-try.

The reason I am bothering to parse the pclntab twice in this way is because packed executables totally screw with section names and sizes, so it is not safe for us to just locate the .text via the file headers.

I will be implementing this in the coming days and testing this, expect a resolution soon!

EDIT: Resolved! Please try out https://github.com/mandiant/GoReSym/releases/tag/v2.0 when you can :)

turekt commented 1 year ago

Hi @stevemk14ebr,

great work! I have tested the 2.0 version against test binaries and the offset issue seems to be patched for these cases.

I still wanted to check whether there could be another edge case so I tested GoReSym and its Ghidra script against go c-shared object binary:

$ cat main.go
package main

//#include <stdio.h>
//
//void hello(void) {
//  printf("Hello from C!");
//}
import "C"

func main() {
    C.hello()
}
$ go install -buildmode=shared -linkshared std
$ go build -buildmode c-shared -ldflags="-s -w" -o main.so main.go
$ GoReSym -d -p -t main.so > main.so.json

Results coming from the Ghidra script:

no estimation renamed 0 and created 1070 functions
entry to _rt0_ function mapping renamed 12 and created 1058 functions
pclntab to TabMeta VA mapping renamed 0 and created 1070 functions
known function names mapping renamed 799 and created 271

If I use GoReSym given offsets:

$ jq .UserFunctions main.so.json
[
  {
    "Start": 438080,
    "End": 438176,
    "PackageName": "main",
    "FullName": "main._Cfunc_hello"
  },
  {
    "Start": 438176,
    "End": 438230,
    "PackageName": "main",
    "FullName": "main.main"
  }
]

I end up nowhere (expected for shared object):

No results for 0x6afa0

Executing known function names strategy determines an offset of 1048576:

                     **************************************************************
                     *                          FUNCTION                          *
                     **************************************************************
                     undefined main.main()
...
0016afa0 49 3b 66 10     CMP        RSP, qword ptr [R14 + 0x10]
0016afa4 76 29           JBE        LAB_0016afcf
0016afa6 48 83 ec 08     SUB        RSP, 0x8
0016afaa 48 89 2c 24     MOV        qword ptr [RSP]=>local_8, RBP
0016afae 48 8d 2c 24     LEA        RBP=>local_8, [RSP]
0016afb2 e8 89 ff        CALL       main._Cfunc_hello
0016afb7 45 0f 57 ff     XORPS      XMM15, XMM15
0016afbb 4c 8b 35        MOV        R14, qword ptr [PTR_001d4fc8]
0016afc2 64 4d 8b 36     MOV        R14, qword ptr FS:[R14]
0016afc6 48 8b 2c 24     MOV        RBP=>local_8, qword ptr [RSP]
0016afca 48 83 c4 08     ADD        RSP, 0x8
0016afce c3              RET

Not sure if I can somehow force Ghidra to map the shared object to specific addresses found by GoReSym.

In any case, I have updated the goresym_rename.py so that it runs all strategies and if it detects that no offset gives the most renames, it will not prompt for strategy/offset input.

Additionally, I have found one case where UserFunctions are set to null in GoReSym output so I have made another change in the extract_funcs function to check for both UserFunctions and StdFunctions before adding them to the list of functions in order to prevent getting the concatenating NoneType to list error.

stevemk14ebr commented 1 year ago

Awesome! I will merge this then, thanks so much for your contribution and bug report(s). Do you know how Ghidra determines the base address to load the shared object at? I can just mirror that logic in GoReSym if it's a standard thing.

turekt commented 1 year ago

Hi @stevemk14ebr,

good question. I checked and here are my observations: Ghidra loads ELF files starting always from 0x100000 (ELF header). The address values where other functions will get mapped are determined via offsets written in section .dynsym.

By using our last shared object file as an example, we can check the hello function .dynsym entry:

00100950 93 00 00 00 12  Elf64_Sym                         [57]
   00100950 93 00 00 00     ddw       93h                     st_name       ; hello
   00100954 12              db        12h                     st_info
   00100955 00              db        0h                      st_other
   00100956 0d 00           dw        Dh                      st_shndx
   00100958 e0 af 06 00 00  dq        6AFE0h                  st_value
   00100960 17 00 00 00 00  dq        17h                     st_size

The st_value is our offset added to the base where the ELF header is mapped, so the hello function shoud be located at 0x100000+0x6afe0=0x16afe0.

Checking that address reveals the disassembly of the hello function:

                     **************************************************************
                     *                          FUNCTION                          *
                     **************************************************************
                     undefined hello()
...
0016afe0 f3 0f 1e fa     ENDBR64
0016afe4 48 8d 35        LEA        RSI, [s_Hello_from_C!_001849f1]
0016afeb bf 01 00        MOV        EDI, 0x1
0016aff0 31 c0           XOR        EAX, EAX
0016aff2 e9 d9 71        JMP        LAB_001121d0

What I don't see here is how stripped functions are laid out, but if I use fixed base offset of 0x100000 on all Start values of functions found by GoReSym, I end up getting correct disassemblies.

Please note that I have only checked this for ELF amd64 shared objects, results might be different for other formats (e.g. DLL) or architectures.

stevemk14ebr commented 1 year ago

Ok I took a look at this and compared it to how IDA loads the .so. I am seeing differing behavior here between what IDA chooses as the base and what Ghidra does, so I'm not going to be integrating this into GoReSym itself. This ends up being great that we kept the offset finding logic you've created!