Support for text based stub libraries (.tbd) files

indygreg commented 3 years ago

Thank you for maintaining goblin. It is a joy being able to open binary files from any platform and analyze their contents without having to install a myriad of tools to support various binary formats.

I recently found myself wanting to parse text based stub libraries (.tbd files) from Rust and was curious if you would be receptive to including support in goblin. (I might contribute support myself.)

.tbd files are essentially descriptors of mach-o dylibs. Apple uses them in their SDKs to describe dylibs. I think the motive behind these files is it enables linkers to do their job using a minimal representation of the dylib without having to ship full dylibs in SDKs. This helps reduce the size of the SDK.

I'm unsure if there is a canonical specification for this file format. However, there's a comprehensive inline comment in the LLVM source code at https://github.com/llvm/llvm-project/blob/main/llvm/lib/TextAPI/MachO/TextStub.cpp that defines it.

The file format is YAML. If I were to implement support for parsing these files in Rust, I'd likely define a bunch of Rust structs representing the various components and then use serde for (de)serialization from/to YAML. If we did this in goblin, we'd pick up a handful of new crate dependencies. I'm unsure if that would be desirable. Of course, we could always define a conditional crate feature to toggle support for text based stub libraries.

Given that text based stub libraries describe mach-o libraries and are used widely on Apple platforms, I can make a compelling case for their inclusion in goblin as a supported format. I was unable to find any Rust crates for parsing this file format on crates.io, so there appears to be a market need.

Are you interested in supported text based stub libraries in goblin? If so, do you have any thoughts on YAML parsing and new crate dependencies?

willglynn commented 3 years ago

My sense is that the data structures and parsing belongs in a new tbd crate, and that goblin should have an optional tbd feature which pulls in that crate as a dependency and adds whatever logic belongs in goblin to abstract over the .dylib/.tbd commonality. @m4b?

m4b commented 3 years ago

@indygreg first off, thank you for the kind words! :)

So re .tbd files, the short is that I would/could be persuaded to include them in goblin. The long is follows, with some pros and cons, and rationale.

So generally speaking, in the past, I've opted for:

very little dependencies
including as submodules in goblin things which are directly related to "binaries" or artifacts in the linking phase (including, ideally, both reading and writing of those); today, we have archive, elf, mach, and pe (and strtab as a helper in all three).

These were the general motivating "directives" of the library. So e.g., when there was some push for wasm support, i was 50/50 on including/not including, creating own submodule, or pulling in external, etc. We may still do that!

But on the subject of .tbd's, while they are a textual format, they're actually treated effectively as a drop in replacement as dynamic libraries on darwin based systems by the linker, so there is definitely a very strong case for just adding another variant in the enum based parser.

On that note, i think it could be reasonable to add the variant, e.g. here:

#[derive(Debug)]
#[allow(clippy::large_enum_variant)]
/// Either a collection of multiple architectures, or a single mach-o binary
pub enum Mach<'a> {
    /// A "fat" multi-architecture binary container
    Fat(MultiArch<'a>),
    /// A regular Mach-o binary
    Binary(MachO<'a>),
    /// A Tbd file
    TBD(TBD<'a>), // or maybe a `&'a str` ?
}

As noted by @willglynn a way to do this might be to have an external tbd crate, which we conditionally include with a new feature ("tbd", perhaps).

There are two major problems with this approach:

I generally believe that types should be invariant across feature cfg's (and i have violated this elsewhere, notably in scroll's error enum for no_std)
I like to be in control of external deps (I've seen dependency spray explode in other projects, where every new update seems to add some new random crate for doing something) (also don't get me wrong, I love cargo crates system and deps!)

so the above two are the major problems I see in adding support for .tbd using an external crate, 1. types whose size/variants change across cfgs, 2. non-control of deps.

In particular, parsing/loading the .tbd might involve:

serde
yaml crate

which will likely massively explode the amount of deps. Maybe this can be "fixed" with a cfg on that variant, etc., but this is in violation of 1., and it just doesn't seem very clean to me.

Also, having had experience with cfgs and rust features, used like this I do believe it's an anti-pattern, and can cause unintentional bloat and/or unusual recompiles when features don't unify in a workspace with multiple crates.

Anyway, these are some of my fears, and so i'm heavily biased adding parsing functionality in goblin itself for tbd (as opposed to external deps); however, as I said, I could be persuaded.

Some random thoughts/questions:

.tbd files are fairly simple; could we write a custom "tbd" parser using scroll
for starters, if we did 1., I think the user will only really want/care about the architectures + symbol list the .tbd file provides, which should be fairly straightforward?
On that note, what particular data/information should we extract from the .tbd in goblin; or put another way, if we're just doing let tbd = yaml::deserialize(tbd_file)? why does/would this need to be in goblin? maybe a good compromise might be to add a TBD hint then instead, and let user deserialize using yaml/whatever crate they want? And if we aren't just returning the yaml-in-struct form, maybe this is a semi-compelling argument to just do newline parsing + roll-your-own-tbd parser inside goblin to get out e.g, the symbols + architecture?

So to summarize:

I agree it would be nice to include/detect/parse? .tbd files (the granularity is what we do exactly :) )
I would prefer not to have conditional cfg's + an external tbd crate that i'm not in control of
if all we're doing is deserializing the yaml, so we can return e.g, the symbols and arch + some other info (I don't think there's much else besides the version and the lib that provides it), maybe it's better to do parsing ourselves and/or just return a &'a str to let the user do parsing themself with whatever they want, or yaml structured deserializer via serde, or etc. the downside is this could be brittle? i'm not sure

Thoughts appreciated :D

EDIT: added clarification i'm heavily biased towards adding parsing functionality inside goblin itself, instead of using external crate with cfgs

indygreg commented 3 years ago

I decided to shave a yak today and I implemented a minimal crate for parsing .tbd using serde-yaml. Unfortunately, I learned when doing this that yaml-rust doesn't expose tags on YAML documents/hashmaps (https://github.com/dtolnay/serde-yaml/issues/147), so I had to do something very ugly to parse the YAML (https://github.com/indygreg/PyOxidizer/blob/ca1c80bbcce64fc58d5f9a9806055bd32ff7a567/text-stub-library/src/lib.rs#L71). It works. But I feel dirty.

This would seemingly be more ammunition for not using an uncontrolled 3rd party crate for the .tbd parsing. The YAML syntax in use in .tbd files seem simple enough that a simple parser could likely be cooked up. (Although TBH I'm unsure if I have the stomach to implement a custom parser for goblin. Serde yes, custom parser I don't know.)

FWIW I validated my .tbd parser by parsing every .tbd file in every SDK in every Xcode version installed on the GitHub Actions workers and it worked. So I have some degree of confidence that the schema is correct, at least for version 3 and 4 of the format. (Version 4 seems to have been introduced for the macOS 11 SDK AFAICT in order to support multi-arch libraries.) Feel free to steal the Rust structs if you want. The code is MPL 2.0. But I give you permission to relicense the code in that crate under whatever license you want. All the attribution I need is maybe a reference in the commit message.

m4b commented 3 years ago

So i'm wondering if a simple, more flexible approach here is just to detect the tbd (glancing at your code, this seems somewhat complicated/tedious), and just return a TBD(&'a str) for the variant? That way downstream users can parse, do whatever they want with it, and we can sidestep the whole external deps/serde/yaml situation in goblin?

indygreg commented 3 years ago

If you are proposing goblin::Object::parse() would return a goblin::Object::Mach TBD(&'a str) variant with the unparsed YAML content, that seems like a reasonable first step to me. The API docs could potentially advertise a 3rd party crate for parsing if we wanted.

In order to avoid a future API break, do you think it would be worth declaring a new TBD type - initially an alias or thin wrapper around &'a str so that in the future if we wanted to add TBD parsing to goblin or hang additional methods off that type, we could do so without an API break? Or does YAGNI apply?

willglynn commented 3 years ago

YAML supports UTF-8, UTF-16, and UTF-32 in both byte orders (see YAML 1.2 § 5.2). &str is specifically UTF-8, so perhaps we should use TBD(&'a [u8]) instead.

I've been assuming that .tbd is a YAML document with a specific structure which would necessitate supporting the entirety of the YAML spec. That's a Herculean task that even dedicated YAML parser libraries fail. It's also possible that .tbd is instead a specific subset of YAML, in which case it might be practical and reasonable to build support into goblin with a scroll-based parser. In the absence of a formal specification, I guess we could run experiments and match the behavior of the first party tools? 🤷‍♂️

Edit: I'd be happy with an TBD newtype which is AsRef<[u8]> and could someday and/or optionally TryInto a parsed type.

m4b commented 3 years ago

n order to avoid a future API break, do you think it would be worth declaring a new TBD type - initially an alias or thin wrapper around &'a str so that in the future if we wanted to add TBD parsing to goblin or hang additional methods off that type, we could do so without an API break? Or does YAGNI apply?

Oh I actually think I like this idea, I believe you’re proposing eg:

pub struct Tbd {
  pub bytes: &[u8]
}

then as first approximation (Maybe final!) we can direct users to parse the file using eg your crate if you publish it or, etc?

On the other hand the user could just do all the same by attempting such a parse if they encounter an Unknown variant?

I guess the advantage to returning the TBD(TBD) variant is just to signal to the user we didn’t encounter something expected (but still leave it up to them how to parse it further...)

indygreg commented 3 years ago

Oh I actually think I like this idea, I believe you’re proposing eg:
pub struct Tbd {
  pub bytes: &[u8]
}
then as first approximation (Maybe final!) we can direct users to parse the file using eg your crate if you publish it or, etc?

That's exactly what I'm proposing.

I guess the advantage to returning the TBD(TBD) variant is just to signal to the user we didn’t encounter something expected (but still leave it up to them how to parse it further...)

Yes. By introducing the enum variant, goblin effectively says we know this data is related to mach-o. That could be a useful signal to end-users and a nudge that they should consider doing something with the data. It does add an API break though. I'm unsure if you are willing to do that with the minimal added benefit considering there would be no built-in support for reading the TBD content. Although we could potentially throw something useful in there, such as sniffing the YAML document versions and exposing the TBD version via the goblin API.

m4b / goblin

Support for text based stub libraries (.tbd) files #265