oasis-tcs / sarif-spec

OASIS SARIF TC: Repository for development of the draft standard, where requests for modification should be made via Github Issues
https://github.com/oasis-tcs/sarif-spec
Other
164 stars 46 forks source link

Add address property to physicalLocation object #302

Closed fishoak closed 5 years ago

fishoak commented 5 years ago

Binary analysis tools may report issues in terms of an effective address. These are not physical locations (as we have been thinking of them) and none of the options that the regions offer suffice to express such addresses adequately. Addresses are not offsets in any meaningful sense. One could shoe-horn them into location.fullyQualifiedLogicalName using kind="address", but that feels unsatisfactory.

The only place in the spec where an address is allowed is in a stackFrame object, but note that stackFrame.location exists too.

Consequently, I propose that we move the address object into the location. It could either go in as a new property location.address, or it could be pushed down into either the physical or logical location properties. Conceptually, it is neither, but I concede that an address is more like a physical thing than a logical thing, so it could go into a new optional property of physicalLocation named "address".

ghost commented 5 years ago

I agree that it is neither fish nor fowl, so it belongs on the location object. @michaelcfanning, do you agree?

michaelcfanning commented 5 years ago

location object could contain this, potentially, or perhaps some other property. stack frames do have this notion. some tools provide a relative virtual address against some other unknown base. to be discussed at the f2f.

fishoak commented 5 years ago

I asked some of our local binary analysis experts about this topic. Below are two responses:

From Tom Johnson:

On this particular topic, I'd say you almost definitely want addresses specified with respect to some base. And possibly allow multiple kinds of bases to be specified. Since the introduction of ASLR, most binaries these days are relocatable - and are actively relocated on purpose. That means dynamic analysis tools may see different addresses from one run to the next (unless one explicitly disables this). Even without ASLR, DLLs are relocated often and can even be relocated differently from one run to the next of the same program (e.g., consider dynamically loaded DLLs).

Also, if you're analyzing object files, there isn't really a fixed notion of absolute address in this context. Technically addresses have not been determined for object files. You can kind-of think of it as every section in an object file starts at address 0. CodeSonar deals with this by artificially loading the object files into a synthetic address space. Other tools may handle object files entirely differently. Certainly it's unlikely that other tools would have the same notion of artificial loading that CodeSonar does.

There are several common notions of address that you'll encounter:

  • "Effective address" - the absolute address in a program's logical memory that something gets loaded at. This is what most people think of when you say "address". And it is the concept that SWYX uses. But it will vary due to relocation operations as mentioned above. If you wanted to use this as the concept for Sarif, the "base" would probably be some notion a "default load". That is, you'd determine where an OS would put the program by default without any consideration for ASLR, some pre-specified resolution for relocation of conflicting DLLs, etc. Whatever absolute address you get there, that's the value you communicate here (instead of the actual address observed by a dynamic tool, say). Figuring this out in a canonical way could be non-trivial, and it's not something tools normally do when looking at something that has been relocated. So it might be a bit of a fuzzy/confusing concept to get right.

  • "Image or section offset" - this is the offset wrt some piece of the logical structure of the program when loaded into memory. Many static tools will know how to work with this concept. And if you say, "this is an address with respect to a base", this is the first type of thing that comes to mind. Whether it's an image or a section offset depends on what kind of artifacts you're dealing with. Section offsets are more natural when working with object files (which aren't combined in a full image yet). Image offsets show up more often with linked executables. However, that's by no means a firm rule. You can certainly talk about section offsets within a linked executable.

  • "File offset/address" - this is the absolute offset of a location within the binary file artifact, ignoring internal logical structure of the file. These are rare, but do show up on occasion. For example, the headers in a binary file often include file offsets here and there. The nice thing about a file offset is that you don't need to parse any of the structure of the file to know the location of what you might be interested in.

If forced to pick only one of these notions, the one that is likely going to allow the most reliable communication method would be section offset. Section offsets survive relocation operations. And section offsets generally make sense for both object files as well as linked executables and dynamic libraries. Most tools already know how to work with them.

The one caveat there is that section offsets cannot be used to locate objects that are not within a program's image. For example, if something is allocated on the heap or stack. Say something generates self-modifying code or something like that. Might be a high-class worry, but might motivate offering flexibility in the specification too.

From Eric Schulte:

The only thing I could add to Tom's response is to suggest that you use the name "address" instead of "effective address." We did a review of what most binary analysis tools use for this concept and "address" was the clear winner.

fishoak commented 5 years ago

Sorry, closed in error!

ghost commented 5 years ago

Thanks for Tom and Eric for that detailed analysis.

The only thing I will add is that SARIF already has a construct for the "file offset/address" concept: it is physicalLocation.region.byteOffset. This is what we would expect a tool to use if it wanted to report, for example, that there was something wrong in the file's header at offset 0xA0 from the start of the file.

I agree with you that "image or section offset" makes sense to express an address that might be relocated. We might imagine an address object that fits into the existing SARIF object model like this:

{                             # A SARIF location object.
  "physicalLocation": {
    "fileLocation": {
      "uri": "main.obj"
    },
    "address": {
      "section": ".TEXT",
      "offset": 1234
    }
  }
}

Tom, could you please explain the scenario where we'd need to specify multiple bases? If we need to support that, we could simply give the locations object an addresses property whose value is an array of address objects:

{                             # A SARIF location object.
  "physicalLocation": {
    "fileLocation": {
      "uri": "main.obj"
    },
    "addresses": [
      {
        "section": ".TEXT",
        "offset": 1234
      },
      ...
    ]
  }
}
tjohnson-gt commented 5 years ago

I'm not sure there's a need for multiple bases for a single location. My comment was that you might want to recognize multiple kinds of bases. And it sounds like you're already considering that with file-offset and section-offset.

So for a structured executable file, you have a clear notion of content organized by sections. And those sections could be relocated in memory at different locations at load time. So, a section-based offset makes sense here.

Another possible subject might be a ROM image for firmware that just gets directly injected into memory somewhere. There may be no internal structure to the image - it's just a block of bytes. In this case, the file-offset notion is appropriate.

We had an in-house discussion that in ELF binaries, there is also a notion of "segment". Typically a segment is a combination of sections that all get loaded contiguously in memory w/ the same access permissions. Sometimes, though, segments include additional parts of the binary file (for example, headers). Someone may find it useful to talk in terms of segment+offset locations rather than section+offset locations.

ghost commented 5 years ago

Got it. How about this for an address object:

{
 "sectionName": ".text",
 "sectionOffset": 1234,   # Byte offset from start of specified section.
 "segmentIndex": 4,       # 0-based index of the segment's entry in the ELF program header table.
 "segmentOffset": 5678    # Byte offset from start of specified segment.
}

Each pair is optional, both pairs can be present. Note that since this is JSON, the numbers are all decimal.

michaelcfanning commented 5 years ago

I'm not sure we want to create a bucket object that contains mutually exclusive properties that define the address. It would be preferable to provide generic object that could be repurposable for various scenarios.

Conceptually, as Tom notes, these addresses are chained, e.g., address of PE + sections header.virtual address (gives us start of a section) + offset. if you think about the problem this way, it is similar to logical locations (except that we're using actual binary internals to build the 'path to address'. the size of these pieces is relevant to analysis.

The BinSkim code may be helpful illustrating how that particular tool manipulates low level binary address information. Poking around, I notice a small other wrinkle, which relates to padding that may be required to produce the address. This BinSkim ImageFieldData construct actually looks close to what might work. I'm thinking something along the lines of:

{ baseAddress, // could be gleaned by examining the parent, think 'fullyQualifiedLogicalName' baseIndex, // information for current base, such as the section, think 'logicalLocationIndex' size, // size of this thing offset, // offset from the current base, to identify a thing within a container, like a specific import // name, // would be '.text' for an address that denotes this section, kind or type, // what is this thing? a segment? a section? padding and/or trailing bytes count, // ? not sure whether this needs to be expressed }

Again, very similar to the logical location and nested files mechanisms already in the format. And so, suffers/benefits from the same advantages and disadvantages. Note that the non-deterministic base address is similar in concept to the value associated with uriBaseId. Section header addresses + offsets are similar to the deterministic relative of paths under source control. Just trying to make SARIF connections to help connect this concept to existing patterns.

Here's the basic view of what we're up to:

(Virtual address of something) - (base address) = relative virtual address

The RVA + some unknown base address (such as the base address of the loaded module).

ghost commented 5 years ago

I don't think it needs to be that complicated. Unlike nested files and logical locations, these address objects have a limited nesting. I think the notion of a "parent" is overkill.

The design I proposed is specific to ELFs; you might consider that a drawback. OTOH, it supports ELF's well, and takes into account that in ELFs, sections have names but segments do not (hence the string-valued sectionName and the integer-valued segmentIndex). We could certainly change it to

{
  "sectionId": ....      # Either a string or on integer
  "byteOffset": 1234,
  "kind": ....           # Either "section" or "segment"; leaves open possible future support for
                         #  other executable formats.
}

... but it does run into the usual problem where our code gen can't deal with a property that might have more than one type.

Tom mentioned (see his bullet point on "effective address") that the absolute address might be hard to define in a clear way; I took that as a clue not to include it in the format. OTOH, for tools that know how to compute it, like BinSkim, we could include absoluteAddress and/or relativeVirtualAddress fields in the address object.

michaelcfanning commented 5 years ago

The limited nesting is a good point, one that occurred to me as well after posting the above. For me, I think the next best step is to look in detail at some analysis results of this kind. The SARIF, as always, should be a support for developers to prove to themselves that there is, in fact, a problem to act on. We should ask ourselves what that process would look like.

Consider a simple example, 'your imports section is executable'. Here are the addresses of interest:

1) Import table RVA location. Somewhere in the binary, we will find the RVA of the import table. This location has an RVA. 2) The section that contains the import table. 3) The section header contains a piece of data that indicates it is executable. 4) The import table itself, i.e., the RVA from #1

The interesting question is how a viewer wants to take the user through this information to conclude, yes, the section that contains my import section is marked executable. I could imagine getting a hex view of the binary with some outlines/chrome demarcating various things. You could imagine a structural view of that data within the binary.

michaelcfanning commented 5 years ago

A proposed design for an address descriptor:

baseAddress kind name index offset length

we should prepare a change draft and review in advance of a TC discussion

michaelcfanning commented 5 years ago

In addition to this core design, SARIF could populate a load map that would be persisted in a similar way as the files and logicalLocations table. address object could refer to these entries, which would also allow for parenting/nesting as other constructs do.

michaelcfanning commented 5 years ago

EBALLOT PROPOSAL Provide an address property on the location object in order to provide address details for results. Addresses can refer to a parent address, in order to render address 'chains' that describe structure within an analysis target (e.g., an offset from a section header). Provide a table of cached address objects at run.addresses consistent with other SARIF tables.

API IMPACT Define an address object. Add location.address (an address instance) Add run.addresses (an array of address instances) Add externalPropertyFiles.addresses to permit externalizing address data. Remove stackFrame.address and stackFrame.offset (both of type integer) and replace them with a new stackFrame.address property of type address. The address object contains the following properties: baseAddress : a base address rendered as a hexadecimal string kind: an open-ended string that identifies the address kind. 'section' and 'segment' are well-known values. name: a name that is associated with the address, e.g., '.text' offset: an offset from the base address, if present, rendered as a hexadecimal string index: an index into run.addresses used to retrieve a cached instance to represent the address parentIndex: an index into run.addresses to retrieve a parent address. the parent can provide a base address (from which the current offset value is relevant) and other details

ghost commented 5 years ago

Feedback from MS [UPDATED]:

By associated an address with a PLC, we will naturally have a mechanism for specifying regions, etc. The standard artifact parenting mechanism can be leveraged. An artifact has a timestamp, settling that concern.

kupsch commented 5 years ago

fileOffset can come from the physicalLocation property.

Should add a snippet object for the contents

michaelcfanning commented 5 years ago

TC33 conclusions:

address moves to physicalLocation add a fullyQualifiedName property to address add an artifactContents.rendered property (a multiformat string)

ghost commented 5 years ago

E-BALLOT #3 PROPOSAL

Provide an address property on the physicalLocation object to provide address details for results. Addresses can refer to a parent address, to render address 'chains' that describe structure within an analysis target (e.g., an offset from a section header). Provide a table of cached address objects at run.addresses consistent with other SARIF tables.

SCHEMA CHANGES

NOTES

An earlier proposal included adding stackFrame.address. But this isn't necessary because stackFrame has a location property of type location, which has a physicalLocation property, which will now have an address.

kupsch commented 5 years ago

The types of baseAddress and offset should be integer as this is their natural type. Other than for display purposes, they need to integers to compute absolute addresses or differences of addresses. Making this an integer means that JSON libraries will produce the correct type automatically. The viewer should determine the output format and displaying the value as hexadecimal is generally no more difficult than outputing a string or decimal representation of a number.

ghost commented 5 years ago

Approved in e-ballot-3.

ghost commented 5 years ago

I confirm that this change is correctly in the schema.