Closed fishoak closed 5 years ago
I agree that it is neither fish nor fowl, so it belongs on the location
object. @michaelcfanning, do you agree?
location object could contain this, potentially, or perhaps some other property. stack frames do have this notion. some tools provide a relative virtual address against some other unknown base. to be discussed at the f2f.
I asked some of our local binary analysis experts about this topic. Below are two responses:
From Tom Johnson:
On this particular topic, I'd say you almost definitely want addresses specified with respect to some base. And possibly allow multiple kinds of bases to be specified. Since the introduction of ASLR, most binaries these days are relocatable - and are actively relocated on purpose. That means dynamic analysis tools may see different addresses from one run to the next (unless one explicitly disables this). Even without ASLR, DLLs are relocated often and can even be relocated differently from one run to the next of the same program (e.g., consider dynamically loaded DLLs).
Also, if you're analyzing object files, there isn't really a fixed notion of absolute address in this context. Technically addresses have not been determined for object files. You can kind-of think of it as every section in an object file starts at address 0. CodeSonar deals with this by artificially loading the object files into a synthetic address space. Other tools may handle object files entirely differently. Certainly it's unlikely that other tools would have the same notion of artificial loading that CodeSonar does.
There are several common notions of address that you'll encounter:
"Effective address" - the absolute address in a program's logical memory that something gets loaded at. This is what most people think of when you say "address". And it is the concept that SWYX uses. But it will vary due to relocation operations as mentioned above. If you wanted to use this as the concept for Sarif, the "base" would probably be some notion a "default load". That is, you'd determine where an OS would put the program by default without any consideration for ASLR, some pre-specified resolution for relocation of conflicting DLLs, etc. Whatever absolute address you get there, that's the value you communicate here (instead of the actual address observed by a dynamic tool, say). Figuring this out in a canonical way could be non-trivial, and it's not something tools normally do when looking at something that has been relocated. So it might be a bit of a fuzzy/confusing concept to get right.
"Image or section offset" - this is the offset wrt some piece of the logical structure of the program when loaded into memory. Many static tools will know how to work with this concept. And if you say, "this is an address with respect to a base", this is the first type of thing that comes to mind. Whether it's an image or a section offset depends on what kind of artifacts you're dealing with. Section offsets are more natural when working with object files (which aren't combined in a full image yet). Image offsets show up more often with linked executables. However, that's by no means a firm rule. You can certainly talk about section offsets within a linked executable.
"File offset/address" - this is the absolute offset of a location within the binary file artifact, ignoring internal logical structure of the file. These are rare, but do show up on occasion. For example, the headers in a binary file often include file offsets here and there. The nice thing about a file offset is that you don't need to parse any of the structure of the file to know the location of what you might be interested in.
If forced to pick only one of these notions, the one that is likely going to allow the most reliable communication method would be section offset. Section offsets survive relocation operations. And section offsets generally make sense for both object files as well as linked executables and dynamic libraries. Most tools already know how to work with them.
The one caveat there is that section offsets cannot be used to locate objects that are not within a program's image. For example, if something is allocated on the heap or stack. Say something generates self-modifying code or something like that. Might be a high-class worry, but might motivate offering flexibility in the specification too.
From Eric Schulte:
The only thing I could add to Tom's response is to suggest that you use the name "address" instead of "effective address." We did a review of what most binary analysis tools use for this concept and "address" was the clear winner.
Sorry, closed in error!
Thanks for Tom and Eric for that detailed analysis.
The only thing I will add is that SARIF already has a construct for the "file offset/address" concept: it is physicalLocation.region.byteOffset
. This is what we would expect a tool to use if it wanted to report, for example, that there was something wrong in the file's header at offset 0xA0 from the start of the file.
I agree with you that "image or section offset" makes sense to express an address that might be relocated. We might imagine an address
object that fits into the existing SARIF object model like this:
{ # A SARIF location object.
"physicalLocation": {
"fileLocation": {
"uri": "main.obj"
},
"address": {
"section": ".TEXT",
"offset": 1234
}
}
}
Tom, could you please explain the scenario where we'd need to specify multiple bases? If we need to support that, we could simply give the locations
object an addresses
property whose value is an array of address
objects:
{ # A SARIF location object.
"physicalLocation": {
"fileLocation": {
"uri": "main.obj"
},
"addresses": [
{
"section": ".TEXT",
"offset": 1234
},
...
]
}
}
I'm not sure there's a need for multiple bases for a single location. My comment was that you might want to recognize multiple kinds of bases. And it sounds like you're already considering that with file-offset and section-offset.
So for a structured executable file, you have a clear notion of content organized by sections. And those sections could be relocated in memory at different locations at load time. So, a section-based offset makes sense here.
Another possible subject might be a ROM image for firmware that just gets directly injected into memory somewhere. There may be no internal structure to the image - it's just a block of bytes. In this case, the file-offset notion is appropriate.
We had an in-house discussion that in ELF binaries, there is also a notion of "segment". Typically a segment is a combination of sections that all get loaded contiguously in memory w/ the same access permissions. Sometimes, though, segments include additional parts of the binary file (for example, headers). Someone may find it useful to talk in terms of segment+offset locations rather than section+offset locations.
Got it. How about this for an address
object:
{
"sectionName": ".text",
"sectionOffset": 1234, # Byte offset from start of specified section.
"segmentIndex": 4, # 0-based index of the segment's entry in the ELF program header table.
"segmentOffset": 5678 # Byte offset from start of specified segment.
}
Each pair is optional, both pairs can be present. Note that since this is JSON, the numbers are all decimal.
I'm not sure we want to create a bucket object that contains mutually exclusive properties that define the address. It would be preferable to provide generic object that could be repurposable for various scenarios.
Conceptually, as Tom notes, these addresses are chained, e.g., address of PE + sections header.virtual address (gives us start of a section) + offset. if you think about the problem this way, it is similar to logical locations (except that we're using actual binary internals to build the 'path to address'. the size of these pieces is relevant to analysis.
The BinSkim code may be helpful illustrating how that particular tool manipulates low level binary address information. Poking around, I notice a small other wrinkle, which relates to padding that may be required to produce the address. This BinSkim ImageFieldData construct actually looks close to what might work. I'm thinking something along the lines of:
{ baseAddress, // could be gleaned by examining the parent, think 'fullyQualifiedLogicalName' baseIndex, // information for current base, such as the section, think 'logicalLocationIndex' size, // size of this thing offset, // offset from the current base, to identify a thing within a container, like a specific import // name, // would be '.text' for an address that denotes this section, kind or type, // what is this thing? a segment? a section? padding and/or trailing bytes count, // ? not sure whether this needs to be expressed }
Again, very similar to the logical location and nested files mechanisms already in the format. And so, suffers/benefits from the same advantages and disadvantages. Note that the non-deterministic base address is similar in concept to the value associated with uriBaseId. Section header addresses + offsets are similar to the deterministic relative of paths under source control. Just trying to make SARIF connections to help connect this concept to existing patterns.
Here's the basic view of what we're up to:
(Virtual address of something) - (base address) = relative virtual address
The RVA + some unknown base address (such as the base address of the loaded module).
I don't think it needs to be that complicated. Unlike nested files and logical locations, these address objects have a limited nesting. I think the notion of a "parent" is overkill.
The design I proposed is specific to ELFs; you might consider that a drawback. OTOH, it supports ELF's well, and takes into account that in ELFs, sections have names but segments do not (hence the string-valued sectionName
and the integer-valued segmentIndex
). We could certainly change it to
{
"sectionId": .... # Either a string or on integer
"byteOffset": 1234,
"kind": .... # Either "section" or "segment"; leaves open possible future support for
# other executable formats.
}
... but it does run into the usual problem where our code gen can't deal with a property that might have more than one type.
Tom mentioned (see his bullet point on "effective address") that the absolute address might be hard to define in a clear way; I took that as a clue not to include it in the format. OTOH, for tools that know how to compute it, like BinSkim, we could include absoluteAddress
and/or relativeVirtualAddress
fields in the address
object.
The limited nesting is a good point, one that occurred to me as well after posting the above. For me, I think the next best step is to look in detail at some analysis results of this kind. The SARIF, as always, should be a support for developers to prove to themselves that there is, in fact, a problem to act on. We should ask ourselves what that process would look like.
Consider a simple example, 'your imports section is executable'. Here are the addresses of interest:
1) Import table RVA location. Somewhere in the binary, we will find the RVA of the import table. This location has an RVA. 2) The section that contains the import table. 3) The section header contains a piece of data that indicates it is executable. 4) The import table itself, i.e., the RVA from #1
The interesting question is how a viewer wants to take the user through this information to conclude, yes, the section that contains my import section is marked executable. I could imagine getting a hex view of the binary with some outlines/chrome demarcating various things. You could imagine a structural view of that data within the binary.
A proposed design for an address descriptor:
baseAddress kind name index offset length
we should prepare a change draft and review in advance of a TC discussion
In addition to this core design, SARIF could populate a load map that would be persisted in a similar way as the files and logicalLocations table. address object could refer to these entries, which would also allow for parenting/nesting as other constructs do.
EBALLOT PROPOSAL Provide an address
property on the location
object in order to provide address details for results. Addresses can refer to a parent address, in order to render address 'chains' that describe structure within an analysis target (e.g., an offset from a section header). Provide a table of cached address
objects at run.addresses
consistent with other SARIF tables.
API IMPACT
Define an address
object.
Add location.address
(an address
instance)
Add run.addresses
(an array of address
instances)
Add externalPropertyFiles.addresses
to permit externalizing address
data.
Remove stackFrame.address
and stackFrame.offset
(both of type integer
) and replace them with a new stackFrame.address
property of type address
.
The address
object contains the following properties:
baseAddress
: a base address rendered as a hexadecimal string
kind
: an open-ended string that identifies the address kind. 'section' and 'segment' are well-known values.
name
: a name that is associated with the address, e.g., '.text'
offset
: an offset from the base address, if present, rendered as a hexadecimal string
index
: an index into run.addresses
used to retrieve a cached instance to represent the address
parentIndex
: an index into run.addresses
to retrieve a parent address. the parent can provide a base address (from which the current offset
value is relevant) and other details
Feedback from MS [UPDATED]:
address
to the physical location object.run.addresses
.memoryObject
and anything else the TC thinks appropriate.By associated an address with a PLC, we will naturally have a mechanism for specifying regions, etc. The standard artifact parenting mechanism can be leveraged. An artifact has a timestamp, settling that concern.
fileOffset
can come from the physicalLocation
property.
Should add a snippet object for the contents
TC33 conclusions:
address moves to physicalLocation add a fullyQualifiedName property to address add an artifactContents.rendered property (a multiformat string)
Provide an address
property on the physicalLocation
object to provide address details for results. Addresses can refer to a parent address, to render address 'chains' that describe structure within an analysis target (e.g., an offset from a section header). Provide a table of cached address
objects at run.addresses
consistent with other SARIF tables.
Define an address
object with the following properties:
baseAddress
of type string
: a base address rendered as a hexadecimal stringkind
of type string
: an open-ended string that identifies the address kind. "section"
and "segment"
are well-known values.name
of type string
: a name that is associated with the address, e.g., ".text
".fullyQualifiedName
of type string
.offset
of type string
: an offset from baseAddress
, if present, rendered as a hexadecimal string.index
of type integer
: an index into run.addresses
used to retrieve a cached instance to represent the address.parentIndex
of type integer
: an index into run.addresses
to retrieve a parent address. The parent can provide a base address (from which the current offset
value is relevant) and other details.In the physicalLocation
object:
address
of type address
, optional.In the run
object:
addresses
of type address[]
, optional, minItems: 0, default: []
, externalizable.In the stackFrame
object:
address
and offset
.In the artifactContents
object:
rendered
of type multiformatString
, optional.In the artifact
object:
roles
property, add a value "memoryContents"
.An earlier proposal included adding stackFrame.address
. But this isn't necessary because stackFrame
has a location
property of type location
, which has a physicalLocation
property, which will now have an address.
The types of baseAddress and offset should be integer as this is their natural type. Other than for display purposes, they need to integers to compute absolute addresses or differences of addresses. Making this an integer means that JSON libraries will produce the correct type automatically. The viewer should determine the output format and displaying the value as hexadecimal is generally no more difficult than outputing a string or decimal representation of a number.
Approved in e-ballot-3.
I confirm that this change is correctly in the schema.
Binary analysis tools may report issues in terms of an effective address. These are not physical locations (as we have been thinking of them) and none of the options that the regions offer suffice to express such addresses adequately. Addresses are not offsets in any meaningful sense. One could shoe-horn them into location.fullyQualifiedLogicalName using kind="address", but that feels unsatisfactory.
The only place in the spec where an address is allowed is in a stackFrame object, but note that stackFrame.location exists too.
Consequently, I propose that we move the address object into the location. It could either go in as a new property location.address, or it could be pushed down into either the physical or logical location properties. Conceptually, it is neither, but I concede that an address is more like a physical thing than a logical thing, so it could go into a new optional property of physicalLocation named "address".