treeform / hottie

Sampling profiler that finds hot paths in your code.
MIT License
44 stars 1 forks source link

Doesn't work on windows #12

Open creikey opened 2 years ago

creikey commented 2 years ago

Here is the output of running hottie:

Program ended
 samples           time   percent what
   10127    10604.209ms  100.000%
Samples per second: 955.0 totalTime: 10.604ms

On my project here: https://github.com/creikey/recurrent-evolution/commit/b3303cce553c411081f056353b4897f5d457cdca

Compiled like this: nim -d:release --debugger:native --stacktrace:on c src\recurrent_evolution.nim && hottie src\recurrent_evolution.exe I am on windows 11

chancyk commented 1 year ago

I get the same result. The issue seems to be that the Rip address is different from those that appear in the objdump file. In my case I get 140737458009076 for Rip but my objdump addresses after hex conversion are between 5368713216 and 5370365077.

Maybe there's some kind of prefixing going on in the Rip address? I did some experiments but no luck.

treeform commented 1 year ago

What compiler are you running to compile? The objdump.exe needs to come from that compiler.

treeform commented 1 year ago

It would be nice if you could provide a simple program, maybe one of the test programs, its compile command, its object dump command and output.

I also don't have windows 11 to test on. I should get that.

chancyk commented 1 year ago

I've run it on a slightly modified version of test2 so it terminates:

import os

var
  i = 0
  x = 0
while true:
  echo "here ", i, " ", os.getCurrentProcessId(), " //", x
  inc i
  for j in 0 .. 100_000_000:
    inc x

  if i > 10:
    break

I was using the nimble gcc originally instead of the one paired with my objdump, but after switching the compiler I get similar hottie output but the addresses are closer.

I've compiled using:

set CC=C:\MinGW64\bin\gcc.exe
nim c --cc:env -r --verbosity:3 --debugger:native -d:release --gc:arc .\examples\test2.nim

The addresses are in the right range now, but I'm still not getting results for the --lines and --procedures modes. I get something from --addresses though, using: hottie -a .\examples\test2.exe

Results:

Program ended
 samples           time   percent what
      42      238.605ms   34.426% C:\Src\hottie\dump.txt:9238
      28      159.070ms   22.951% C:\Src\hottie\dump.txt:9250
      24      136.346ms   19.672% C:\Src\hottie\dump.txt:9249
      23      130.664ms   18.852% C:\Src\hottie\dump.txt:9237
       4       22.724ms    3.279%
       1        5.681ms    0.820% C:\Src\hottie\dump.txt:9240
Samples per second: 176.0 totalTime: 0.693ms

And here's the objdump file: https://www.dropbox.com/s/3umdt4mro947ydz/dump.txt?dl=0

Compiler Version:

λ C:\MinGW64\bin\gcc.exe -v
Using built-in specs.
COLLECT_GCC=C:\MinGW64\bin\gcc.exe
COLLECT_LTO_WRAPPER=C:/MinGW64/bin/../libexec/gcc/x86_64-w64-mingw32/8.1.0/lto-wrapper.exe
Target: x86_64-w64-mingw32
Configured with: ../../../src/gcc-8.1.0/configure --host=x86_64-w64-mingw32 --build=x86_64-w64-mingw32 --target=x86_64-w64-mingw32 --prefix=/mingw64 --with-sysroot=/c/mingw810/x86_64-810-win32-seh-rt_v6-rev0/mingw64 --enable-shared --enable-static --disable-multilib --enable-languages=c,c++,fortran,lto --enable-libstdcxx-time=yes --enable-threads=win32 --enable-libgomp --enable-libatomic --enable-lto --enable-graphite --enable-checking=release --enable-fully-dynamic-string --enable-version-specific-runtime-libs --disable-libstdcxx-pch --disable-libstdcxx-debug --enable-bootstrap --disable-rpath --disable-win32-registry --disable-nls --disable-werror --disable-symvers --with-gnu-as --with-gnu-ld --with-arch=nocona --with-tune=core2 --with-libiconv --with-system-zlib --with-gmp=/c/mingw810/prerequisites/x86_64-w64-mingw32-static --with-mpfr=/c/mingw810/prerequisites/x86_64-w64-mingw32-static --with-mpc=/c/mingw810/prerequisites/x86_64-w64-mingw32-static --with-isl=/c/mingw810/prerequisites/x86_64-w64-mingw32-static --with-pkgversion='x86_64-win32-seh-rev0, Built by MinGW-W64 project' --with-bugurl=https://sourceforge.net/projects/mingw-w64 CFLAGS='-O2 -pipe -fno-ident -I/c/mingw810/x86_64-810-win32-seh-rt_v6-rev0/mingw64/opt/include -I/c/mingw810/prerequisites/x86_64-zlib-static/include -I/c/mingw810/prerequisites/x86_64-w64-mingw32-static/include' CXXFLAGS='-O2 -pipe -fno-ident -I/c/mingw810/x86_64-810-win32-seh-rt_v6-rev0/mingw64/opt/include -I/c/mingw810/prerequisites/x86_64-zlib-static/include -I/c/mingw810/prerequisites/x86_64-w64-mingw32-static/include' CPPFLAGS=' -I/c/mingw810/x86_64-810-win32-seh-rt_v6-rev0/mingw64/opt/include -I/c/mingw810/prerequisites/x86_64-zlib-static/include -I/c/mingw810/prerequisites/x86_64-w64-mingw32-static/include' LDFLAGS='-pipe -fno-ident -L/c/mingw810/x86_64-810-win32-seh-rt_v6-rev0/mingw64/opt/lib -L/c/mingw810/prerequisites/x86_64-zlib-static/lib -L/c/mingw810/prerequisites/x86_64-w64-mingw32-static/lib '
Thread model: win32
gcc version 8.1.0 (x86_64-win32-seh-rev0, Built by MinGW-W64 project)
treeform commented 1 year ago

Nice the addresses are correct. The objdump parser I have probably does not support the new objdump format. I have seen this happen on mac. Man I wish objdump had like a json format that never changes. I think some tweaks to the parser are required.

chancyk commented 1 year ago

This might be an issue with newer versions of GCC. I'm getting that address mismatch between GetThreadContext and the objdump for the v11.1 gcc and objdump from .choosenim\toolchains\mingw64\bin. I'll explore some more tomorrow.

chancyk commented 1 year ago

So dumping cpuHotAddresses after compiling test2 with GCC 11.1 I get:

λ hottie .\examples\test2.exe                                                                                
Running objdump...                                                                                           
Program ended                                                                                                
 samples           time   percent what                                                                       
    7685     8049.077ms  100.000%
count : address [HEX] :: BINARY                                                                            
1155 : 140695318335471 [00007FF62E788FEF] :: 0000000000000000011111111111011000101110011110001000111111101111
623 : 140695318335539 [00007FF62E789033] :: 0000000000000000011111111111011000101110011110001001000000110011 
1016 : 140695318335474 [00007FF62E788FF2] :: 0000000000000000011111111111011000101110011110001000111111110010
1331 : 140695318335520 [00007FF62E789020] :: 0000000000000000011111111111011000101110011110001001000000100000
1245 : 140695318335477 [00007FF62E788FF5] :: 0000000000000000011111111111011000101110011110001000111111110101
449 : 140695318335467 [00007FF62E788FEB] :: 0000000000000000011111111111011000101110011110001000111111101011 
1 : 140721003754228 [00007FFC2970F2F4] :: 0000000000000000011111111111110000101001011100001111001011110100   
391 : 140695318335530 [00007FF62E78902A] :: 0000000000000000011111111111011000101110011110001001000000101010 
1 : 140721003755380 [00007FFC2970F774] :: 0000000000000000011111111111110000101001011100001111011101110100   
394 : 140695318335524 [00007FF62E789024] :: 0000000000000000011111111111011000101110011110001001000000100100 
1079 : 140695318335542 [00007FF62E789036] :: 0000000000000000011111111111011000101110011110001001000000110110
Samples per second: 954.8 totalTime: 8.049ms                                                                 

And after compiling with GCC 8.1 I get:

λ hottie .\examples\test2.exe
Running objdump...
Program ended
 samples           time   percent what
    3555     3681.458ms   69.083% C:\Src\hottie\examples\test2.nim:10
    1588     1644.488ms   30.859% C:\Users\chanc\.choosenim\toolchains\nim-1.6.6\lib\system\iterators_1.nim:91
       3        3.107ms    0.058%
622 : 4230807 [0000000000408E97] :: 0000000000000000000000000000000000000000010000001000111010010111
1012 : 4230800 [0000000000408E90] :: 0000000000000000000000000000000000000000010000001000111010010000
630 : 4230813 [0000000000408E9D] :: 0000000000000000000000000000000000000000010000001000111010011101
898 : 4230822 [0000000000408EA6] :: 0000000000000000000000000000000000000000010000001000111010100110
690 : 4230809 [0000000000408E99] :: 0000000000000000000000000000000000000000010000001000111010011001
2 : 140721003754228 [00007FFC2970F2F4] :: 0000000000000000011111111111110000101001011100001111001011110100
642 : 4230816 [0000000000408EA0] :: 0000000000000000000000000000000000000000010000001000111010100000
1 : 140721003755380 [00007FFC2970F774] :: 0000000000000000011111111111110000101001011100001111011101110100
649 : 4230803 [0000000000408E93] :: 0000000000000000000000000000000000000000010000001000111010010011
Samples per second: 965.6 totalTime: 5.329ms

And here's my dump code:

for address in cpuHotAddresses.keys():
      let count = cpuHotAddresses[address]
      if count > 0:
        echo count, " : ", address, " [", address.int.toHex(), "] :: ", toBin(address.int, 64)
chancyk commented 1 year ago

Disclaimer: This is not my expertise, so consider me a blind man lost in a maze and interpret my feedback accordingly. : )

This may be relevant: https://stackoverflow.com/questions/61061830/i-am-wondering-if-pie-does-anything-if-the-aslr-is-turned-off-on-the-system-or

There's also an objdump option called --adjust-vma= that might be useful for skewing the addresses.

chancyk commented 1 year ago

I'm definitely getting some kind of address randomization on builds from GCC 11.1. It's stable between runs of the executable though.

chancyk commented 1 year ago

This might be a means of getting the runtime offset from the Windows API you're already using:

#include <windows.h>
#include <TlHelp32.h>

uintptr_t GetModuleBaseAddress(DWORD procId, wchar_t* modName)
{
    //initialize to zero for error checking
    uintptr_t modBaseAddr = 0;

    //get a handle to a snapshot of all modules
    HANDLE hSnap = CreateToolhelp32Snapshot(TH32CS_SNAPMODULE | TH32CS_SNAPMODULE32, procId);

    //check if it's valid
    if (hSnap != INVALID_HANDLE_VALUE)
    {
        //this struct holds the actual module information
        MODULEENTRY32 modEntry;

        //this is required for the function to work
        modEntry.dwSize = sizeof(modEntry);

        //If a module exists, get it's entry
        if (Module32First(hSnap, &modEntry))
        {
            //loop through the modules
            do
            {
                //compare the module name against ours
                if (!_wcsicmp(modEntry.szModule, modName))
                {
                    //copy the base address and break out of the loop
                    modBaseAddr = (uintptr_t)modEntry.modBaseAddr;
                    break;
                }

            //each iteration we grab the next module entry
            } while (Module32Next(hSnap, &modEntry));
        }
    }

    //free the handle
    CloseHandle(hSnap);
    return modBaseAddr;
}

Found on this random game exploit website: https://guidedhacking.com/threads/get-module-base-address-tutorial-dwgetmodulebaseaddress.5781/

chancyk commented 1 year ago

Okay... apologies for all the comments but I think I figured out how to get back to the objdump addresses.

I'm going to be a little verbose for posterity's sake. There's a value in the Windows binary PE format (Portable Executable) called ImageBase. An easy way to dump these values is to pip install the Python pereader module and then run:

python -c "import pereader;pe = pereader.PE(r'.\examples\test2.exe');print(pe.dump())"

ImageBase is towards the top under the # Optional Header section. This value will correspond with the beginning of the output from objdump -dl .\examples\test2.exe, in my case (0x140000000):

.\examples\test2.exe:     file format pei-x86-64

Disassembly of section .text:

0000000140001000 <__mingw_invalidParameterHandler>:
objdump: Dwarf Error: Could not find abbrev number 1764.
   140001000:   55                      push   %rbp
   140001001:   48 89 e5                mov    %rsp,%rbp

The second piece comes from using the Toolhelp32 API. We'll iterate through the modules looking for the name that matches our exe name with (C++ example in the guidedhacking.com link above):

proc getBaseAddress*(pid: int, modName: string): uint =
  var h = CreateToolhelp32Snapshot(TH32CS_SNAPMODULE or TH32CS_SNAPMODULE32, DWORD(pid))
  if h != INVALID_HANDLE_VALUE:
    var modEntry: MODULEENTRY32
    modEntry.dwSize = DWORD(sizeof(modEntry))
    var valid = Module32First(h, modEntry.addr)
    while valid == 1:
      let modEntryName = filter(modEntry.szModule, x => x > 0).map(x => chr(x)).join("")
      if modEntryName == modName:
        result = cast[uint](modEntry.modBaseAddr)
        break
      valid = Module32Next(h, modEntry.addr)
    CloseHandle(h)

Finally, we combine the two offsets to get back to our objdump addresses:

let exe_base_address = getBaseAddress(pid, "test2.exe")     # address from Toolhelp32
let image_base_address = 0x140000000                               # from PE header, may need to grab dynamically
let relative = (address.int64 - exe_base_address.int64) + image_base_address   # address is the instruction address from GetThreadContext
chancyk commented 1 year ago

I think the last piece of this for larger programs is that the entire Relocations Directory needs to be parsed from the PE header to map the runtime addresses. Large programs get chunked and moved around in memory based on the relocations.

In my case there's 31 of these (2 partial entries from peheader below):

# Base Relocations Directory -----------------------------------------------------

[IMAGE_BASE_RELOCATION]
0x1dcc00    0x0    VirtualAddress    0x195000
0x1dcc04    0x4    SizeOfBlock       0xc

0xa768 IMAGE_REL_BASED_DIR64
0x0 IMAGE_REL_BASED_ABSOLUTE

[IMAGE_BASE_RELOCATION]
0x1dcc0c    0x0    VirtualAddress    0x196000
0x1dcc10    0x4    SizeOfBlock       0x20

0xa250 IMAGE_REL_BASED_DIR64
0xa300 IMAGE_REL_BASED_DIR64
0xa320 IMAGE_REL_BASED_DIR64

The directory might also accessible from this API: https://learn.microsoft.com/en-us/windows/win32/api/dbghelp/

chancyk commented 1 year ago

So parsing the base relocations doesn't seem necessary. The issue I was running into is that DLL modules within the process show up at high address values, so those just need to be ignored or I was counting them separately as "external". Otherwise, the Image Base of 0x140000000 usually shouldn't change but it can be parsed from the PE file as so:

var pe_file = open(filepath, fmRead)
let sig_offset = pe_file.read(60, uint32).int      # Signature offset field is at fixed position of 0x3C bytes.
let coff_header = sig_offset + 4                   # COFF Header immediately follows the 4 byte signature in Image files.
let optional_header_pos = coff_header + 20         # Last COFF field "Characteristics" is at 18 bytes with a 2 byte width.
let optional_header_size_pos = coff_header + 16    # Offset 16 bytes within the COFF header.
let signature = pe_file.read(sig_offset, array[4, char])
check_signature:  ['P', 'E', '\0', '\0']
let optional_header_size  = pe_file.read(optional_header_size_pos, uint16)
check_optional_header_size:  optional_header_size == 0
let optional_header_magic = pe_file.read(optional_header_pos, uint16)
check_optional_header_magic:  optional_header_magic in [MAGIC_PE32, MAGIC_PE32_PLUS, MAGIC_ROM]

var
    image_base_pos: int
    image_base: uint64

if optional_header_magic == MAGIC_PE32:
    image_base_pos = optional_header_pos + 28        # ImageBase : PE32 has an extra 4-byte field (BaseOfData) over PE32+
elif optional_header_magic == MAGIC_PE32_PLUS:
    image_base_pos = optional_header_pos + 24        # ImageBase : PE32 has an extra 4-byte field (BaseOfData) over PE32+

image_base = cast[uint64](pe_file.read(image_base_pos, uint32))

EDIT: There still seems to be something weird going on here with the addresses though.