Modern linux types can sometimes just be levels of indirection

ilch1 commented 4 years ago

Re: volatilityfoundation/volatility3#151

ilch1 commented 4 years ago

I created the following example C structure:

struct outer_struct {
    struct {
        int field1;
        int field2;
        struct {
            int field3;
            int field4;
        };
    };
    int field5;
};

The generated DWARF information (dwarfdump output) is:

0x00000086:   DW_TAG_structure_type
                DW_AT_name  ("outer_struct")
                DW_AT_byte_size (0x14)
                DW_AT_decl_file ("/Users/ilya/git/github.com/volatilityfoundation/dwarf2json/test/anonymous_types.c")
                DW_AT_decl_line (3)

0x0000008e:     DW_TAG_member
                  DW_AT_type    (0x00000096 "structure ")
                  DW_AT_decl_file   ("/Users/ilya/git/github.com/volatilityfoundation/dwarf2json/test/anonymous_types.c")
                  DW_AT_decl_line   (4)
                  DW_AT_data_member_location    (0x00)

0x00000096:     DW_TAG_structure_type
                  DW_AT_byte_size   (0x10)
                  DW_AT_decl_file   ("/Users/ilya/git/github.com/volatilityfoundation/dwarf2json/test/anonymous_types.c")
                  DW_AT_decl_line   (4)

0x0000009a:       DW_TAG_member
                    DW_AT_name  ("field1")
                    DW_AT_type  (0x0000006e "int")
                    DW_AT_decl_file ("/Users/ilya/git/github.com/volatilityfoundation/dwarf2json/test/anonymous_types.c")
                    DW_AT_decl_line (5)
                    DW_AT_data_member_location  (0x00)

0x000000a6:       DW_TAG_member
                    DW_AT_name  ("field2")
                    DW_AT_type  (0x0000006e "int")
                    DW_AT_decl_file ("/Users/ilya/git/github.com/volatilityfoundation/dwarf2json/test/anonymous_types.c")
                    DW_AT_decl_line (6)
                    DW_AT_data_member_location  (0x04)

0x000000b2:       DW_TAG_member
                    DW_AT_type  (0x000000ba "structure ")
                    DW_AT_decl_file ("/Users/ilya/git/github.com/volatilityfoundation/dwarf2json/test/anonymous_types.c")
                    DW_AT_decl_line (7)
                    DW_AT_data_member_location  (0x08)

0x000000ba:       DW_TAG_structure_type
                    DW_AT_byte_size (0x08)
                    DW_AT_decl_file ("/Users/ilya/git/github.com/volatilityfoundation/dwarf2json/test/anonymous_types.c")
                    DW_AT_decl_line (7)

0x000000be:         DW_TAG_member
                      DW_AT_name    ("field3")
                      DW_AT_type    (0x0000006e "int")
                      DW_AT_decl_file   ("/Users/ilya/git/github.com/volatilityfoundation/dwarf2json/test/anonymous_types.c")
                      DW_AT_decl_line   (8)
                      DW_AT_data_member_location    (0x00)

0x000000ca:         DW_TAG_member
                      DW_AT_name    ("field4")
                      DW_AT_type    (0x0000006e "int")
                      DW_AT_decl_file   ("/Users/ilya/git/github.com/volatilityfoundation/dwarf2json/test/anonymous_types.c")
                      DW_AT_decl_line   (9)
                      DW_AT_data_member_location    (0x04)

The DWARF information is processed iteratively by dwarf2json. Thus, the definition of anonymous structures may not be known when they are referenced. In the example above, the definition of anonymous structures embedded in outer_struct have not been processed when first encountered. In order to collapse anonymous structures, the processing would need to be made recursive, which is not trivial. Another option is to make the processing multi-pass, where in the 2nd pass the anonymous structure references are replaced by the flattened instance.

We should discuss if this is better solved on the consumer side. It looks like gdb/lldb solve it that way. The fact that the structure is anonymous (does not have an identifier) could be preserved by dwarf2json and used by the consumer to correctly expose the fields contained by the anonymous structure.

ikelos commented 4 years ago

I'd be ok doing that, but at the moment the JSON has no strictly defined means of indicating whether a member is anonymous or not. We have the DWARF generator producing unnamed_field_<id>, for pdbconv we hark back to the previous volatility and refer to both anonymous and unnamed (which in the windows world are seemingly different) as __anonymous_<id> or __unnamed_<id> respectively. Which means we either need to include an additional field in the schema (entirely doable) or we keep the format as defined and we do the condensing in a second pass (slightly lossy in terms of data). It sounds like adding a field would be useful, but I'm interested why we a) haven't run into this before and b) haven't run into this on windows yet? 5:S

Definitely something we can discuss further at the next meeting...

npetroni commented 4 years ago

Can you paste the current dwarf2json output for the above example?

npetroni commented 4 years ago

Can you summarize the algorithm(s) used by lldb and gdb for this scenario?

ilch1 commented 4 years ago

Can you paste the current dwarf2json output for the above example?

"outer_struct": {
"size": 20,
"fields": {
"field5": {
"type": {
"kind": "base",
"name": "int"
},
"offset": 16
},
"unnamed_field_0": {
"type": {
"kind": "struct",
"name": "unnamed_a35d783f54979948"
},
"offset": 0
}
},
"kind": "struct"
},
"unnamed_a35d783f54979948": {
"size": 16,
"fields": {
"field1": {
"type": {
"kind": "base",
"name": "int"
},
"offset": 0
},
"field2": {
"type": {
"kind": "base",
"name": "int"
},
"offset": 4
},
"unnamed_field_8": {
"type": {
"kind": "struct",
"name": "unnamed_e43b13834081c6ac"
},
"offset": 8
}
},
"kind": "struct"
},
"unnamed_e43b13834081c6ac": {
"size": 8,
"fields": {
"field3": {
"type": {
"kind": "base",
"name": "int"
},
"offset": 0
},
"field4": {
"type": {
"kind": "base",
"name": "int"
},
"offset": 4
}
},
"kind": "struct"
}

ilch1 commented 4 years ago

Here is an example of the output with anonymous field:

      "size": 20,
      "fields": {
        "field5": {
          "type": {
            "kind": "base",
            "name": "int"
          },
          "offset": 16
        },
        "unnamed_field_0": {
          "type": {
            "kind": "struct",
            "name": "unnamed_a35d783f54979948"
          },
          "offset": 0,
          "anonymous": true
        }
      },
      "kind": "struct"
    },
    "unnamed_a35d783f54979948": {
      "size": 16,
      "fields": {
        "field1": {
          "type": {
            "kind": "base",
            "name": "int"
          },
          "offset": 0
        },
        "field2": {
          "type": {
            "kind": "base",
            "name": "int"
          },
          "offset": 4
        },
        "unnamed_field_8": {
          "type": {
            "kind": "struct",
            "name": "unnamed_e43b13834081c6ac"
          },
          "offset": 8,
          "anonymous": true
        }
      },
      "kind": "struct"
    },

ikelos commented 4 years ago

Cool, that looks like what I expected, now we just need to check if I made the schema correctly and if the code to back it up works... 5;) I think it was mm_struct that was the key example?

ilch1 commented 4 years ago

The latest commit fixes the metadata to be compatible with schema6.2.0 in issue151-flatten-anonymous branch of volatility3.

The new metadata for Linux will look like:

  "metadata": {
    "linux": {
      "elf_symbols": true,
      "elf_buildid": "130921b08a47907e6701bc7fc1a0253b00aab68b",
      "dwarf_symbols": true,
      "dwarf_types": true,
      "dwarf_buildid": "130921b08a47907e6701bc7fc1a0253b00aab68b"
    },
    "producer": {
      "name": "dwarf2json",
      "version": "0.6.0"
    },
    "format": "6.2.0"
  },

The new metadata for Mac will look like:

  "metadata": {
    "mac": {
      "macho_symbols": true,
      "macho_uuid": "C8FBE733-0FE1-3C84-AC87-2085A51904EF",
      "dwarf_types": true,
      "dwarf_symbols": true,
      "dwarf_uuid": "C8FBE733-0FE1-3C84-AC87-2085A51904EF"
    },
    "producer": {
      "name": "dwarf2json",
      "version": "0.6.0"
    },
    "format": "6.2.0"
  },

ikelos commented 4 years ago

Cool, would these be better in a namespace (so something like:

"mac": {
  "macho" : {
    "symbols": true,
    "uuid": "C8FBE733-0FE1-3C84-AC87-2085A51904EF",
  }
  "dwarf": {
    "types": true,
    "symbols": true,
    "uuid": "C8FBE733-0FE1-3C84-AC87-2085A51904EF"
  }
}

and obviously the same for linux?

ikelos commented 4 years ago

Also, are there other fields we'd want to add, or are we happy that these are the main ones we'll need (as in, can I block off additional properties to the elf/macho/dwarf groups or should I leave them open to add additional fields)? I'd prefer to have everything well defined, but happy to leave it you guys to make a decision... 5:)

ikelos commented 4 years ago

Thanks for the schema change, it looked good. I've pushed up some additional schema changes that codify the examples you provided but using sub-namespaces. Should be easy to change them if there's a good reason not to use the hierarchy, but as is hopefully it'll be straightforward to make the change in dwarf2json.

Also, a question about the macho uuid and the dwarf uuid, is there ever a time they can be different values and/or one could be not present whilst the other one is? I'm just wondering whether storing it twice is beneficial or could lead to inconsistencies (if they've never supposed to be different, but in the file, they are)?

ilch1 commented 4 years ago

Cool, would these be better in a namespace (so something like:

"mac": { "macho" : { "symbols": true, "uuid": "C8FBE733-0FE1-3C84-AC87-2085A51904EF", } "dwarf": { "types": true, "symbols": true, "uuid": "C8FBE733-0FE1-3C84-AC87-2085A51904EF" } }

and obviously the same for linux?

I like the suggestion of using hierarchical namespaces.

Also, are there other fields we'd want to add, or are we happy that these are the main ones we'll need (as in, can I block off additional properties to the elf/macho/dwarf groups or should I leave them open to add additional fields)? I'd prefer to have everything well defined, but happy to leave it you guys to make a decision... 5:)

I'm not sure. I'd like to discuss this schema with @npetroni, and then I'll get back to you.

Also, a question about the macho uuid and the dwarf uuid, is there ever a time they can be different values and/or one could be not present whilst the other one is? I'm just wondering whether storing it twice is beneficial or could lead to inconsistencies (if they've never supposed to be different, but in the file, they are)?

Yes, a user can select a macho file that does not match the dwarf file (they were compiled separately or from different source), in which case the UUID values would be different. In fact, capturing that in the symbols metadata would be helpful in debugging any potential issues because of the mismatch. The same idea applies to linux elf files.

ilch1 commented 4 years ago

The following is a modification of the original proposal. The mac/linux section has 2 lists: symbols and types. Each entry in the list has the following fields: kind, sha256, and name. Below is an example for mac:

"mac": {
      "symbols": [
        {
          "kind": "dwarf",
          "name": "somefile",
          "sha256": "d80566ab70265665c4144485d4d896b8405fdd0d2c9675b4be427b0e4c07086b"
        },
        {
          "kind": "symtab",
          "name": "somefile",
          "sha256": "d80566ab70265665c4144485d4d896b8405fdd0d2c9675b4be427b0e4c07086b"
        }
      ],
      "types": [
        {
          "kind": "dwarf",
          "name": "somefile",
          "sha256": "d80566ab70265665c4144485d4d896b8405fdd0d2c9675b4be427b0e4c07086b"
        }
      ]
    },

Here is an example for linux:

    "linux": {
      "symbols": [
        {
          "kind": "dwarf",
          "name": "module.ko",
          "sha256": "299d6f6f1821c15d109fad0a651e0e2a55cb2ce70340cf3c09e82f4f757b8449",
        },
        {
          "kind": "symtab",
          "name": "module.ko",
          "sha256": "299d6f6f1821c15d109fad0a651e0e2a55cb2ce70340cf3c09e82f4f757b8449"
        },
        {
          "kind": "system-map",
          "name": "System.map-4.15.0-66-generic",
          "sha256": "d1001d271b33b64afbab7fcb5993a9dcf3e4c19d0bc71ca8148035a24bb27f4e"
        }
      ],
      "types": [
        {
          "kind": "dwarf",
          "name": "module.ko",
          "sha256": "299d6f6f1821c15d109fad0a651e0e2a55cb2ce70340cf3c09e82f4f757b8449",
        }
      ]
    },

This code is available in issue-11-anonymous-types branch.

ikelos commented 4 years ago

Hmmm, so I like the layout, but I'd suggest we come up with all the possible "kinds" we'll support (at the moment dwarf, symtab and system-map) and then I'd probably have each item being something like:

"thing" : [
  {
    "kind": "dwarf",
    "name": "dwarfthing",
    "hash_type": "sha256",
    "hash_value": "23984320985320498532049853..."
  }
]

That would allow us to upgrade supported hashes over the supported versions of the schema without forcing/mandating only one. There are other ways of expressing it, I just want to think through how we'd intend to move to a newer hash as the old one becomes insecure (I assume the reason for sha256 is the worry that someone could create a malicious copy of the file that would be mistaken for it, otherwise we could just use md5 if all we're trying to do is distinguish the files absent of an attacker). Anyway, lemme know what you think, whether I'm overcomplicating it, whatever. I've mocked up the changes in the branch in vol3...

ilch1 commented 4 years ago

I'd suggest we come up with all the possible "kinds" we'll support (at the moment dwarf, symtab and system-map)

Yes, dwarf, symtab and system-map are the known "kinds". I do not know if/when there will be additional ones.

As far as hash_type and hash_value organization, I can make the suggested change. I'm not sure it is necessary. I agree that we could use md5 or sha1 instead of sha256. I do not think we would need switch to a different hash for a long time (or ever), since as you've pointed out, we're trying to do is distinguish the files absent of an attacker.

ikelos commented 4 years ago

Ok, well, let's leave it for future compatibility just in case (you never know, some systems may stop supporting md5 or sha1 one day!), and then both this branch and the branch in vol3 should be in sync. What more needs to happen before we merge?

ilch1 commented 4 years ago

We need to make sure output of dwarf2json is compatible with the schema changes in vol3. We can merge after that.

ilch1 commented 4 years ago

Fixed in #13.

volatilityfoundation / dwarf2json

Modern linux types can sometimes just be levels of indirection #11