moyix / pdbparse

Python code to parse Microsoft PDB files
Other
316 stars 84 forks source link

examples/pdb_tpi_vtypes.py produces invalid Python #39

Closed npetroni closed 7 years ago

npetroni commented 7 years ago

examples/pdb_tpi_vtypes.py produces output that can't be consumed by Python. Partial output:

ntkrpamp_types = {
  'b'LIST_ENTRY64'' : [ 0x10, {
    'b'Flink'' : [ 0x0, ['unsigned long long']],
    'b'Blink'' : [ 0x8, ['unsigned long long']],
} ],
  'b'LIST_ENTRY32'' : [ 0x8, {
...
npetroni commented 7 years ago

pdb_tpi_vtypes-output.txt

moyix commented 7 years ago

Just pushed a commit that should fix this, could you try it on your PDB collection? Seemed to work with the ntkrnlmp.pdb I had sitting around.

npetroni commented 7 years ago

Thanks! It looks like this version parses, but all of the unnamed tags reduce to '<unnamed-tag>'. Hopefully that's an easy fix. Then I'd be happy to run it across my collection.

npetroni commented 7 years ago

pdb_tpi_vtypes-output-de06ea2.txt

moyix commented 7 years ago

OK, I've fixed up the unnamed bit. More bytes/string madness :(

npetroni commented 7 years ago

Cool, getting closer :). Looks like enum keys for some numbers are off (b'\xff' vs -1).

Example (old on top, new on bottom):

<     'InterfaceType' : [ 0x12c, ['Enumeration', dict(target = 'long', choices = {0: 'Internal', 1: 'Isa', 2: 'Eisa', 3: 'MicroChannel', 4: 'TurboChannel', 5: 'PCIBus', 6: 'VMEBus', 7: 'NuBus', 8: 'PCMCIABus', 9: 'CBus', 10: 'MPIBus', 11: 'MPSABus', 12: 'ProcessorInternal', 13: 'InternalPowerBus', 14: 'PNPISABus', 15: 'PNPBus', 16: 'Vmcs', 17: 'ACPIBus', 18: 'MaximumInterfaceType', -1: 'InterfaceTypeUndefined'})]],
---
>     'InterfaceType' : [ 0x12c, ['Enumeration', dict(target = 'long', choices = {0: 'Internal', 1: 'Isa', 2: 'Eisa', 3: 'MicroChannel', 4: 'TurboChannel', 5: 'PCIBus', 6: 'VMEBus', 7: 'NuBus', 8: 'PCMCIABus', 9: 'CBus', 10: 'MPIBus', 11: 'MPSABus', 12: 'ProcessorInternal', 13: 'InternalPowerBus', 14: 'PNPISABus', 15: 'PNPBus', 16: 'Vmcs', 17: 'ACPIBus', 18: 'MaximumInterfaceType', b'\xff': 'InterfaceTypeUndefined'})]],

Interestingly, it looks like there was something similarly odd even in the old version (old on top, new on bottom):

<     'State' : [ 0x20, ['Enumeration', dict(target = 'long', choices = {0: 'LdrModulesPlaceHolder', 1: 'LdrModulesMapping', 2: 'LdrModulesMapped', 3: 'LdrModulesWaitingForDependencies', 4: 'LdrModulesSnapping', 5: 'LdrModulesSnapped', 6: 'LdrModulesCondensed', 7: 'LdrModulesReadyToInit', 8: 'LdrModulesInitializing', 9: 'LdrModulesReadyToRun', '\xfb': 'LdrModulesMerged', '\xfd': 'LdrModulesSnapError', '\xfc': 'LdrModulesInitError', -1: 'LdrModulesUnloading', '\xfe': 'LdrModulesUnloaded'})]],
---
>     'State' : [ 0x20, ['Enumeration', dict(target = 'long', choices = {0: 'LdrModulesPlaceHolder', b'\xfc': 'LdrModulesInitError', 2: 'LdrModulesMapped', 3: 'LdrModulesWaitingForDependencies', 4: 'LdrModulesSnapping', 5: 'LdrModulesSnapped', 6: 'LdrModulesCondensed', 1: 'LdrModulesMapping', 8: 'LdrModulesInitializing', 9: 'LdrModulesReadyToRun', b'\xfe': 'LdrModulesUnloaded', b'\xff': 'LdrModulesUnloading', b'\xfb': 'LdrModulesMerged', 7: 'LdrModulesReadyToInit', b'\xfd': 'LdrModulesSnapError'})]],

Do we expect the keys '\xfd' through '\xfe' to be numbers? Several of the mainline Volatility profiles have this exact feature, so it either works fine or is not critical...

moyix commented 7 years ago

I've committed a patch that should fix this. It changes the interpretation of 'char' to be a signed 8-bit integer rather than a string of length 1. That should also cover the \xfd and \xfe cases too.

I also noticed and fixed a bug in how array sizes were calculated, which was causing some arrays to claim they had a negative number of elements (you can see these in the Volatility profiles), so you will see a small change to some of the Volatility profiles (for the better!).

npetroni commented 7 years ago

Sweet, both fixes look great. Nice catch on the arrays.

I ran it over my corpus and found that some enumerations are missing in the new version. Here's a snippet from an example PDB, ntkrnlmp.pdb/C0F42DF3DB264AC4AD62602E3961A1E72/ntkrnlmp.pdb (old on top, new on bottom):

<     'InterfaceType' : [ 0xdc, ['Enumeration', dict(target = 'long', choices = {0: 'Internal', 1: 'Isa', 2: 'Eisa', 3: 'MicroChannel', 4: 'TurboChannel', 5: 'PCIBus', 6: 'VMEBus', 7: 'NuBus', 8: 'PCMCIABus', 9: 'CBus', 10: 'MPIBus', 11: 'MPSABus', 12: 'ProcessorInternal', 13: 'InternalPowerBus', 14: 'PNPISABus', 15: 'PNPBus', 16: 'Vmcs', 17: 'MaximumInterfaceType', -1: 'InterfaceTypeUndefined'})]],
---
>     'InterfaceType' : [ 0xdc, ['Enumeration', dict(target = 'long', choices = {})]],
2582c2582
<     'ChildInterfaceType' : [ 0xe4, ['Enumeration', dict(target = 'long', choices = {0: 'Internal', 1: 'Isa', 2: 'Eisa', 3: 'MicroChannel', 4: 'TurboChannel', 5: 'PCIBus', 6: 'VMEBus', 7: 'NuBus', 8: 'PCMCIABus', 9: 'CBus', 10: 'MPIBus', 11: 'MPSABus', 12: 'ProcessorInternal', 13: 'InternalPowerBus', 14: 'PNPISABus', 15: 'PNPBus', 16: 'Vmcs', 17: 'MaximumInterfaceType', -1: 'InterfaceTypeUndefined'})]],
---
>     'ChildInterfaceType' : [ 0xe4, ['Enumeration', dict(target = 'long', choices = {})]],
2616c2616
<     'AllocationType' : [ 0x8, ['Enumeration', dict(target = 'long', choices = {0: 'ArbiterRequestLegacyReported', 1: 'ArbiterRequestHalReported', 2: 'ArbiterRequestLegacyAssigned', 3: 'ArbiterRequestPnpDetected', 4: 'ArbiterRequestPnpEnumerated', -1: 'ArbiterRequestUndefined'})]],
---
>     'AllocationType' : [ 0x8, ['Enumeration', dict(target = 'long', choices = {})]],
2627c2627

Everything else looks good so far but I'll do a more thorough check tomorrow.

moyix commented 7 years ago

Well this looks like an exciting bug (or I'm doing something really stupid). It seems like there's something nondeterministic going on. Over 10 runs it seems to succeed around half the time:

for i in {1..10}; do
    pdb_tpi_vtypes.py ntkrnlmp.pdb | grep ArbiterRequestLegacyReported >/dev/null && echo "Found" || echo "Not found"
done

Produces:

Found
Found
Found
Found
Found
Not found
Not found
Not found
Found
Not found

All I can think of is that there might be some code in there that implicitly depends on dictionary ordering, which seems to be random as of Python3. But I don't know immediately where that would be happening; I'll have to take a closer look tomorrow.

lucasg commented 7 years ago

The issue is how forward references are merged.

When it's decoded "correctly" :

'_PNP_RESOURCE_REQUEST' : [ 0x28, {
Container:
    leaf_type = 'LF_MEMBER'
    fldattr = Container:
        noconstruct = False
        noinherit = False
        pseudo = False
        mprop = 'MTvanilla'
        access = 'public'
        compgenx = False
    index = Container:
        length = 38
        tpi_idx = 5647
        leaf_type = 'LF_ENUM'
        count = 6
        prop = Container:
            fwdref = False
            opcast = False
            opassign = False
            cnested = False
            isnested = False
            ovlops = False
            ctor = False
            packed = False
            reserved = 0
            scoped = False
        utype = 'T_INT4'
        fieldlist = Container:
            length = 206
            tpi_idx = 5646
            leaf_type = 'LF_FIELDLIST'
            substructs = [
                Container:
                    leaf_type = 'LF_ENUMERATE'
                    fldattr = Container:
                        noconstruct = False
                        noinherit = False
                        pseudo = False
                        mprop = 'MTvanilla'
                        access = 'public'
                        compgenx = False
                    name = 'ArbiterRequestUndefined'
                    enum_value = -1
                Container:
                    leaf_type = 'LF_ENUMERATE'
                    fldattr = Container:
                        noconstruct = False
                        noinherit = False
                        pseudo = False
                        mprop = 'MTvanilla'
                        access = 'public'
                        compgenx = False
                    name = 'ArbiterRequestLegacyReported'
                    enum_value = 0
                Container:
                    leaf_type = 'LF_ENUMERATE'
                    fldattr = Container:
                        noconstruct = False
                        noinherit = False
                        pseudo = False
                        mprop = 'MTvanilla'
                        access = 'public'
                        compgenx = False
                    name = 'ArbiterRequestHalReported'
                    enum_value = 1
                Container:
                    leaf_type = 'LF_ENUMERATE'
                    fldattr = Container:
                        noconstruct = False
                        noinherit = False
                        pseudo = False
                        mprop = 'MTvanilla'
                        access = 'public'
                        compgenx = False
                    name = 'ArbiterRequestLegacyAssigned'
                    enum_value = 2
                Container:
                    leaf_type = 'LF_ENUMERATE'
                    fldattr = Container:
                        noconstruct = False
                        noinherit = False
                        pseudo = False
                        mprop = 'MTvanilla'
                        access = 'public'
                        compgenx = False
                    name = 'ArbiterRequestPnpDetected'
                    enum_value = 3
                Container:
                    leaf_type = 'LF_ENUMERATE'
                    fldattr = Container:
                        noconstruct = False
                        noinherit = False
                        pseudo = False
                        mprop = 'MTvanilla'
                        access = 'public'
                        compgenx = False
                    name = 'ArbiterRequestPnpEnumerated'
                    enum_value = 4
            ]
        name = '_ARBITER_REQUEST_SOURCE'
    name = 'AllocationType'
    offset = 8
    'AllocationType' : [ 0x8, ['Enumeration', dict(target = 'long', choices = {0: 'ArbiterRequestLegacyReported', 1: 'ArbiterRequestHalReported', 2: 'ArbiterRequestLegacyAssigned', 3: 'ArbiterRequestPnpDetected', 4: 'ArbiterRequestPnpEnumerated', -1: 'ArbiterRequestUndefined'})]],

and "incorrectly" :

'_PNP_RESOURCE_REQUEST' : [ 0x28, {
Container:
    leaf_type = 'LF_MEMBER'
    fldattr = Container:
        noconstruct = False
        noinherit = False
        pseudo = False
        mprop = 'MTvanilla'
        access = 'public'
        compgenx = False
    index = Container:
        length = 38
        tpi_idx = 5647
        leaf_type = 'LF_ENUM'
        count = 6
        prop = Container:
            fwdref = False
            opcast = False
            opassign = False
            cnested = False
            isnested = False
            ovlops = False
            ctor = False
            packed = False
            reserved = 0
            scoped = False
        utype = 'T_INT4'
        fieldlist = Container:
            length = 206
            tpi_idx = 5646
            leaf_type = 'LF_FIELDLIST'
            substructs = []
        name = '_ARBITER_REQUEST_SOURCE'
    name = 'AllocationType'
    offset = 8
    'AllocationType' : [ 0x8, ['Enumeration', dict(target = 'long', choices = {})]],

The only difference in the metadata is the tpi_index : 5647 (bad) or 5646 (good).

Microsoft cvdump return the following informations :

cvdump.exe -t F:\Symbols\ntkrnlmp.pdb\C0F42DF3DB264AC4AD62602E3961A1E72\ntkrnlmp.pdb | Select-String _PNP_RESOURCE_REQUEST -Context 5

 Element type : 0x1607

  0x160c : Length = 42, Leaf = 0x1505 LF_STRUCTURE
        # members = 0,  field list type 0x0000, FORWARD REF,
        Derivation list type 0x0000, VT shape type 0x0000
>       Size = 0, class name = _PNP_RESOURCE_REQUEST, UDT(0x00001611)

  0x160d : Length = 10, Leaf = 0x1002 LF_POINTER
        Pointer (NEAR32), Size: 0
        Element type : 0x160C

                member name = 'Status'

  0x1611 : Length = 42, Leaf = 0x1505 LF_STRUCTURE
        # members = 10,  field list type 0x1610,
        Derivation list type 0x0000, VT shape type 0x0000
>       Size = 40, class name = _PNP_RESOURCE_REQUEST, UDT(0x00001611)

  0x1612 : Length = 38, Leaf = 0x1505 LF_STRUCTURE
        # members = 0,  field list type 0x0000, FORWARD REF,
        Derivation list type 0x0000, VT shape type 0x0000
        Size = 0, class name = _IO_RESOURCE_LIST, UDT(0x00001d56)

I'm pretty sure sometimes we return the forward reference instead of the real structure. And there is probably a off-by-one error somwhere since (0x1610 == 5648 , 0x1611 == 5649 ).

lucasg commented 7 years ago

Ok scratch what I've just said : this is a construct bug (or the way we use it) and the way we handle the -1 enum value.

Minimal regress test :

enum_list = b'\x02\x15\x03\x00\x00\x80\xffArbiterRequestUndefined\x00\xf1'
print(lfFieldList.parse(enum_list))
(lin_env_3) lucasg@DESKTOP-822EGME:/mnt/f/Dev/pdbparse$ python pdbparse/tpi.py
Container:
    substructs = []
(lin_env_3) lucasg@DESKTOP-822EGME:/mnt/f/Dev/pdbparse$ python pdbparse/tpi.py
Container:
    substructs = [
        Container:
            leaf_type = 'LF_ENUMERATE'
            substructs_data_ptr = 2
            test = None
            type_info = Container:
                fldattr = Container:
                    noconstruct = False
                    noinherit = False
                    pseudo = False
                    mprop = 'MTvanilla'
                    access = 'public'
                    compgenx = False
                enum_value_data_ptr = 4
                test = None
                value = Container:
                    value_or_type = 32768
                    name_or_val = Container:
                        value = -1
                        name = 'ArbiterRequestUndefined'
    ]

The result is not deterministic and coherent with @moyix observations.

lucasg commented 7 years ago

Okay found the culprit :

    LF_NUMERIC          = 0x8000,
    LF_CHAR             = 0x8000,
    LF_SHORT            = 0x8001,
    LF_USHORT           = 0x8002,

I copied the leaf types values from microsoft pdb cvinfo.h , without realising there were duplicates for macros uses. If it's not an issue in C, it is one for construct.Enum. Since the val switch rely on 4 supported values (LF_CHAR, LF_SHORT, LF_USHORT, LF_LONG,LF_ULONG) sometimes, based on the dict ordering, it would select LF_NUMERIC instead of LF_CHAR and return an empty enum list since the parsing fail (no default parser defined).

Here the diff that solve the issue :

diff --git a/pdbparse/tpi.py b/pdbparse/tpi.py index b3d32d5..d4f92a1 100755

--- a/pdbparse/tpi.py
+++ b/pdbparse/tpi.py
@@ -617,10 +617,10 @@ leaf_type = Enum(ULInt16("leaf_type"),
     LF_MATRIX           = 0x151c,

     LF_VFTABLE          = 0x151d,      # a virtual function table
-    LF_ENDOFLEAFRECORD  = 0x151d,
+    # LF_ENDOFLEAFRECORD  = 0x151d,

     LF_TYPE_LAST        = 0x151d + 1, # one greater than the last type record
-    LF_TYPE_MAX         = (0x151d + 1) - 1,
+    # LF_TYPE_MAX         = (LF_TYPE_LAST) - 1,

     LF_FUNC_ID          = 0x1601,    # global func ID
     LF_MFUNC_ID         = 0x1602,    # member func ID
@@ -635,9 +635,9 @@ leaf_type = Enum(ULInt16("leaf_type"),
                                      # only generated by linker

     LF_ID_LAST          = 0x1607 + 1, # one greater than the last ID record
-    LF_ID_MAX           = (0x1607 - 1) - 1,
+    # LF_ID_MAX           = (LF_ID_MAX) - 1,

-    LF_NUMERIC          = 0x8000,
+    # LF_NUMERIC          = 0x8000,
     LF_CHAR             = 0x8000,
     LF_SHORT            = 0x8001,
     LF_USHORT           = 0x8002,

Since I had to heavily operate on my local version of pdbparse in order to track and debug this issue, @moyix I would appreciate if you could commit this fix from your side.

L.

moyix commented 7 years ago

That fix looks good! I don't believe anything is using LF_ENDOFLEAFRECORD, LF_TYPE_MAX, LF_ID_MAX, or LF_NUMERIC, so this should be okay. The first three are presumably just used for array sizes in C, and the last one is presumably used as a bitmask to check if something is a numeric type. So there might be an argument for keeping LF_NUMERIC around as a separate global constant somewhere, but unless someone complains let's leave it out for now.

Thanks for tracking this down!

@npetroni, hopefully this is the last of the compatibility issues affecting Volatility profiles :)

npetroni commented 7 years ago

Based on my testing, everything looks good as compared to the old version. Thanks for the quick turnaround on these. I think we can close this issue.

I did turn up a few problems that exist in both the old and new versions, but they aren't impacting anyone AFAIK. They only manifest themselves on a small number of the PDBs that I have, none of which are important to me at the moment.

moyix commented 7 years ago

Great! Thanks for all your help testing.