Redesign `goblin::elf::dynamic::DynamicInfo` to expose the optionality of `DT_JMPREL`

m4b / goblin

An impish, cross-platform binary parsing crate, written in Rust

MIT License

1.17k stars 156 forks source link

Redesign `goblin::elf::dynamic::DynamicInfo` to expose the optionality of `DT_JMPREL` #344

Open koutheir opened 1 year ago

koutheir commented 1 year ago

In ELF, the DT_JMPREL dynamic array tag is optional, and can be present (holding an address value) or absent. The value of that tag is exposed as goblin::elf::dynamic::DynamicInfo.jmprel, which is of type usize. DynamicInfo.jmprel is set to zero if DT_JMPREL is absent from the ELF file.

Therefore, based on the definition of goblin::elf::dynamic::DynamicInfo, there is no way to differentiate the absence of DT_JMPREL from its presence with the value zero, because zero can be a valid address, especially in privileged mode and in embedded systems.

I think DT_JMPREL should be exposed by DynamicInfo as optional information. We should make the type of DynamicInfo.jmprel an Option<usize> instead.

m4b commented 1 year ago

Sounds good to me, hopefully it’s straightforward fix? Good to get this stuff right now. Are there any other fields that would have similar semantics?

koutheir commented 1 year ago

hopefully it’s straightforward fix?

The fix for jmprel is not complicated, but it is a breaking change.

Are there any other fields that would have similar semantics?

Yes, there are a lot, as shown in Figure 2-7. Dynamic Array Tags, d_tag in the Tool Interface Standard (TIS), Executable and Linking Format (ELF) Specification, Version 1.2 and Dynamic Linking.

The following tags are optional in an executable:

DT_BIND_NOW
DT_DEBUG
DT_FINI
DT_INIT
DT_JMPREL
DT_NEEDED
DT_PLTGOT
DT_PLTREL
DT_PLTRELSZ
DT_RPATH
DT_TEXTREL

The following tags are optional in a shared object:

DT_BIND_NOW
DT_FINI
DT_INIT
DT_JMPREL
DT_NEEDED
DT_PLTGOT
DT_PLTREL
DT_PLTRELSZ
DT_REL
DT_RELA
DT_RELAENT
DT_RELASZ
DT_RELENT
DT_RELSZ
DT_SONAME
DT_SYMBOLIC
DT_TEXTREL

The following tags may be optional (they are unspecified):

All values from DT_LOPROC to DT_HIPROC (0x70000000-0x7fffffff).

In addition to the optional vs. mandatory nature, there are dependencies between some tags:

No tags can follow DT_NULL.
If DT_JMPREL is present, then DT_PLTREL and DT_PLTRELSZ become mandatory.
If DT_RELA is present, then DT_RELASZ and DT_RELAENT become mandatory.
If DT_REL is present, then DT_RELSZ and DT_RELENT become mandatory.

Therefore, the following fields in DynamicInfo require changes:

rela, relasz, relaent, relacount, rel, relsz, relent, relcount, pltgot, pltrelsz, pltrel, jmprel, init, fini, soname, textrel: they are all optional, with interdependencies.
hash: this is mandatory and must not be optional.

I may have missed some other tags, and I came into the conclusion that:

Most tags should be considered optional, unless specifically stated otherwise.
Some tags that depend on each other should be represented as structures or enumerations, instead of simple integers.
We need to provide access to values of dynamic tags that we don't understand or process.

I have a question, though: why are we transforming the values of some tags from virtual addresses into file offsets?

koutheir commented 1 year ago

@m4b, the following is an idea about the new DynamicInfo (here called DynamicTags64), which now has zero public fields. Please review it before I attempt to implement it.

impl DynamicTags64 {
    pub fn parse() -> Self {}

    pub fn tags_iter(&self) -> impl Iterator<Item = (NonZeroU64, u64)> {}

    // DT_NEEDED
    pub fn needed_libraries_iter(&self) -> impl Iterator<Item = &CStr> {}

    // DT_JMPREL, DT_PLTRELSZ, DT_PLTREL
    pub fn procedure_linkage_table_relocation_info(&self) -> Option<ProcedureLinkageTableRelocations64> {}

    // DT_PLTGOT
    pub fn global_offset_table_address(&self) -> Option<u64> {}

    // DT_HASH
    pub fn symbol_hash_table_address(&self) -> u64 {}

    // DT_STRTAB, DT_STRSZ
    pub fn string_table_info(&self) -> StringTable64 {}

    // DT_SYMTAB, DT_SYMENT
    pub fn symbol_table_info(&self) -> SymbolTable64 {}

    // DT_RELA, DT_RELASZ, DT_RELAENT
    pub fn explicit_relocation_table_info(&self) -> Option<RelocationTable64> {}

    // DT_REL, DT_RELSZ, DT_RELENT
    pub fn implicit_relocation_table_info(&self) -> Option<RelocationTable64> {}

    // DT_INIT
    pub fn initialization_function_address(&self) -> Option<u64> {}

    // DT_FINI
    pub fn termination_function_address(&self) -> Option<u64> {}

    // DT_SONAME
    pub fn shared_object_name(&self) -> Option<&CStr> {}

    // DT_RPATH, DT_RUNPATH
    pub fn library_search_path_info(&self) -> LibrarySearchPaths {}

    // DT_SYMBOLIC
    pub fn local_symbol_resolution_starts_from_shared_object(&self) -> bool {}

    // DT_DEBUG
    pub fn debug_value(&self) -> Option<u64> {}

    // DT_TEXTREL
    pub fn relocations_can_modify_read_only_segments(&self) -> bool {}

    // DT_BIND_NOW
    pub fn process_all_relocations_early(&self) -> bool {}

    // DT_INIT_ARRAY, DT_INIT_ARRAYSZ
    pub fn initialization_functions_iter(&self) -> impl Iterator<Item = u64> {}

    // DT_FINI_ARRAY, DT_FINI_ARRAYSZ
    pub fn termination_functions_iter(&self) -> impl Iterator<Item = u64> {}

    // DT_INIT
    pub fn flags(&self) -> Option<u64> {}

    // DT_PREINIT_ARRAY, DT_PREINIT_ARRAYSZ
    pub fn pre_initialization_functions_iter(&self) -> impl Iterator<Item = u64> {}
}

where:

pub struct ProcedureLinkageTableRelocations64 {
    pub address: u64,
    pub total_size: u64,
    pub relocations_type: u64,
}

pub struct StringTable64 {
    pub address: u64,
    pub total_size: u64,
}

pub struct SymbolTable64 {
    pub address: u64,
    pub entry_size: u64,
}

pub struct RelocationTable64 {
    pub address: u64,
    pub total_size: u64,
    pub entry_size: u64,
}

pub struct LibrarySearchPaths<'elf> {
    pub rpath: DynamicTagsLibrarySearchPathIter<'elf>,
    pub runpath: DynamicTagsLibrarySearchPathIter<'elf>,
}

struct DynamicTagsIter64<'elf> {}

impl<'elf> Iterator for DynamicTagsIter64<'elf> {
    type Item = (NonZeroU64, u64);
    fn next(&mut self) -> Option<Self::Item> {}
}

struct DynamicTagsNeededIter64<'elf> {}

impl<'elf> Iterator for DynamicTagsNeededIter64<'elf> {
    type Item = &'elf CStr;
    fn next(&mut self) -> Option<Self::Item> {}
}

pub struct DynamicTagsLibrarySearchPathIter<'elf> {}

impl<'elf> Iterator for DynamicTagsLibrarySearchPathIter<'elf> {
    type Item = &'elf CStr;
    fn next(&mut self) -> Option<Self::Item> {}
}

branan commented 1 year ago

hash: this is mandatory and must not be optional.

While this is correct according to the sysv ABI, recent version of gnu ld have stopped emitting DT_HASH in favor of only including DT_GNU_HASH. hash should be considered optional in Goblin, or else it will be impossible to parse files built with modern toolchains.

I'm unsure of what "recent" means here, but at the very least programs built on Ubuntu 22.04 do not include a DT_HASH entry

m4b commented 1 year ago

@koutheir thanks for taking time to write this up and the api proposal.

So this would be a moderate breaking change, and tentatively i'm ok with it.

Some initial thoughts after reading your proposal:

I generally don't like "Java bean" style getters, but as you point out the inter-relations between the tags and the optionality of some may make this a better rusty fit
I'm unsure about the renaming of the tags -> function names that describe the tag, e.g. textrel -> relocations_can_modify_read_only_segments a. while this is a quite good description actually of what the tag means, i'm worried there is a discrepancy between other parts of goblin/elf which tend to skew towards faithfully using field names/type names as close as possible to the elf headers/etc., and this new proposal which makes it more "rusty" + more descriptive names. This could be strange for users expecting to look for something like textrel and not knowing there is a function name that describes what textrel is, if that makes any sense?
otherwise your proposal is really good, and seems reasonable to me, and use of optionals better reflects the semantics of these tags and their presence (or not), etc.
returning $size instead of usize probably also makes sense; DynamicInfo was originally just a quick struct to get this information easily when i was working on dryad ages ago and didn't want to cast everywhere; in hindsight it was probably a mistake, users should decide to cast if they want or not.
re your question why some are turned into vm offsets, i think this was primarily for use in dryad at the time, which used those values explicitly, or it was used elsewhere for indexing (many of those fields are indexes so usize was more convenient)

Anyway, those are my initial concerns, mostly:

goblin tries to minimize breaking changes at this stage (even though not 1.0, which I've wanted for a while, really I should just fix up some minor things and slap on 1.0 and that's it)
the api looks good, though it seems like it doesn't fit well with the rest of structs goblin provides, maybe some others have some thoughts on this?

koutheir commented 1 year ago

I generally don't like "Java bean" style getters...

Can you explain what you mean?

while this is a quite good description actually of what the tag means, i'm worried there is a discrepancy between other parts of goblin/elf which tend to skew towards faithfully using field names/type names as close as possible to the elf headers/etc.,

I'm not sure I understand the approach taken in goblin. I effectively observed this reuse of ELF abbreviations in the goblin interface, reflecting the ELF structures underneath, but then I also see issues such as #345 where assumptions are made just to give the illusion that strings are UTF-8 character sequences inside the ELF string table. I think we need to stick to a specific level of abstraction and remain consistent. Please let me know which level of abstraction I should assume.

This could be strange for users expecting to look for something like textrel and not knowing there is a function name that describes what textrel is...

The approach I take in my crates, in order to fix this issue, is to use the #[doc(alias = "...")] attribute on functions, in order to:

Give the function a meaningful name.
Allow multiple terms to refer to that same function.

For example, in my selinux crate, you can find the function SecurityContext.current() by searching for the term current (which is what describes the feature in general) or by looking for the underlying libselinux function name getcon (when that information is known beforehand). This allows the crate API to use meaningful names, while allowing accurate and fast finding of information through searching.

returning $size instead of usize probably also makes sense...

What is $size is this context?

users should decide to cast if they want or not.

I agree, but then again, which level of abstraction am I to assume. Should I give the actual information, even if it means handing out u64 from ELF64 binaries, or should I assume (for the user) that sizes and offsets and the like won't exceed usize and truncate values?

re your question why some are turned into vm offsets, i think this was primarily for use in dryad at the time, which used those values explicitly, or it was used elsewhere for indexing (many of those fields are indexes so usize was more convenient)

This is problematic, because there is nothing in the documentation or API that indicates that these values are file offsets, instead of (what ELF documents) virtual addresses. I don't expect people to read the crate's source code to figure out why they are getting the wrong number in these fields.

goblin tries to minimize breaking changes at this stage (even though not 1.0, which I've wanted for a while, really I should just fix up some minor things and slap on 1.0 and that's it)

I'm very grateful for goblin, and the effort that went into it, and allowing the community to enjoy it. However, I feel the API still needs breaking changes.

the api looks good, though it seems like it doesn't fit well with the rest of structs goblin provides

I agree, and I value consistency. Because of that, I think we need to either (1) adapt the interface to make it more like the rest of the crate, or (2) adapt the crate and make it more like this interface. I vote for (2).