m4b / goblin

An impish, cross-platform binary parsing crate, written in Rust
MIT License
1.18k stars 156 forks source link

Goblin ergonomics #58

Open m4b opened 6 years ago

m4b commented 6 years ago

This deeply annoys me:

    let peek = goblin::peek(&mut fd)?;
    if let Hint::Unknown(magic) = peek {
        println!("unknown magic: {:#x}", magic)
    } else {
        let bytes = { let mut v = Vec::new(); fd.read_to_end(&mut v)?; v };
        match peek {
            Hint::Elf(_) => {

I think there's an architectural problem and an ergonomics problem here.

  1. We don't want to read the entire binary/file if it doesn't even have proper magic; this is what peek is for
  2. If we pass the peek test, we want to read the file in full, and pass these bytes into goblin, and receive the enum variant
  3. We don't want to even think about the Unknown variant - we already peeked to make sure the magic is good! - everything else is just a parse error for that respective file format

So, what I want is my cake and eat it too:

I want the peek to ensure the magic is correct, and route to the correct binary parser, and return this result, without the Unknown variant, without temporarily allocations (or the full fd read is passed through).

@philipc @endeav0r you seem to be using goblin::Object as clients, does this bother you?

Anyone who happens to be watching/reading this, I'm open to proposals how to fix this make it nicer.

Afaics, being flexible w.r.t. the bytes + reading is going to be tricky; first thing that comes to my mind is some kind of closure style or an inout, like:

// has no Unknown variant, and is also totally me just randomly typing stuff
let object: Option<Result<Object>> = Object::parse_and_fill(fd, &mut bytes);

This would be a breaking change, but I think its important to get right sooner rather than later

philipc commented 6 years ago

I haven't used goblin::Object (or if I have I've forgotten where).

So far I've only used goblin with memmap input, and I think the existing Object::parse would work for that.

The proposed parse_and_fill makes sense to me. Why would it be a breaking change to add it? I tend to prefer ordering the return value as Result<Option<Object>> though.

m4b commented 6 years ago

It could be a breaking change because i want to remove the Unknown variant since the peek ostensibly makes it redundant, or forces an unreachable clause (which is a sign of bad design to me)

sunfishcode commented 6 years ago

What if peek returned an Option<Hint>?

endeav0r commented 6 years ago
  1. I am 100% ok with breaking changes.
  2. This is confusing: https://github.com/m4b/goblin/blob/master/src/lib.rs#L200 . Is is_lsb little endian?
  3. How about something like:
let goblin_loader = goblin::load(&mut fd);
let endianness = goblin_loader.peek().endian(); // Option<goblin::Endian>
let architecture = goblin_loader.peek().architecture(); // Option<goblin::Architecture>
let elf = if goblin_loader.is_elf() {
    goblin_loader.elf() // elf = Option<goblin::elf::Elf>
}
else {
    panic!("Not an elf")
}
let elf = match goblin_loader.executable() {
    match goblin::loader::Elf(elf) => elf, // goblin::elf::Elf
    _ => panic!("Not an elf")
};

Not exactly, but you kind of get the idea.

m4b commented 6 years ago

So to be clear the pattern I don't like is this:

    if let Hint::Unknown(magic) = peek {
        println!("unknown magic: {:#x}", magic)
    } else {
        let bytes = { let mut v = Vec::new(); fd.read_to_end(&mut v)?; v };
        let object = Object::parse(&bytes)?;
        match object {
         Object::Elf(elf) => {
             println!("elf: {:#?}", &elf);
         },
         Object::PE(pe) => {
             println!("pe: {:#?}", &pe);
         },
         Object::Mach(mach) => {
             println!("mach: {:#?}", &mach);
         },
         Object::Archive(archive) => {
             println!("archive: {:#?}", &archive);
         },
         Object::Unknown(magic) => { unreachable!() }
        }

That last unreachable in Object::Unknown is, to use an unpalatable term, "has code smellies" to me ;)

I actually don't think there is way to fix this, short of adding dependent types to rust :thinking:

So there's a couple of problems:

  1. object needs bytes (it has a lifetime), so we can't expose an api like fn parse<R: Read>(r: R) -> Object (it actually used to be like this prior to move to zero-copy)
  2. we don't want to read a whole file just to discard it because the magic is bad
  3. if we peek, the if condition guarantees us that the Unknown variant cannot occur, hence its unreachable. But this is runtime information (hence the semi flippant dependent types remark)

To be precisely clear, I wanted to remove the Object::Unknown variant, but I don't think this is possible if we have an api like fn parse(bytes: &[u8]) -> Object since the bytes could have a bad magic.

Alternatively, we could return a Error::BadMagic variant when the magic doesn't match our list of acceptable magicks. But this would then require the client to examine the error for the specific BadMagic variant, instead of the Unknown variant, which I don't like, as it seems to "overload" the error type in some sense.

Of course this could be solved by also returning a new Object variant without the unknown case, and only provided by a function like fn peek_and_parse which returns the special variant iff the peek does not fail, and a None otherwise, e.g. something like:

pub enum DefinitelyABinary {
  Mach(mach),
  Elf(elf),
  PE(pe),
}

pub fn peek_and_parse<R: Read>(fd: R, bytes: &mut Vec<u8>) -> Result<Option<DefinitelyABinary>> {
  match peek(fd) {
    Unknown(_) => None // we lose magic...
    Peek::Elf(_) => fd.read_to_end(bytes); DefinitelyABinary(elf::Elf::parse(&bytes)
    Peek::Mach(_) => // same but parse mach
    Peek::PE(_) => // same but parse pe
  }
}

but this isn't very satisfying to me either. Maybe it should be. I dunno?

I'm not sure there's a perfect solution here :/

EDIT

To preserve the magic information, we probably want a signature like:

type UnknownMagic = usize;
Result<Result<DefinitelyABinary, UnknownMagic>, goblin::Error>

But the double result will very likely just get ignored via ? Maybe that's ok?

m4b commented 6 years ago
let object = Object::peek_and_parse(fd, &mut bytes)??;

It looks we're angrily confused about the goblin library and this function call :rofl:

sunfishcode commented 6 years ago

How about this?

    let peek = goblin::peek(&mut fd)?;
    match peek {
        None => println!("unknown magic: {:#x}", magic),
        Some(magic) => {
            let bytes = { let mut v = Vec::new(); fd.read_to_end(&mut v)?; v };
            let object = Object::parse(magic, &bytes)?;
            match object {
                Object::Elf(elf) => {
                    println!("elf: {:#?}", &elf);
                },
                Object::PE(pe) => {
                    println!("pe: {:#?}", &pe);
                },
                Object::Mach(mach) => {
                    println!("mach: {:#?}", &mach);
                },
                Object::Archive(archive) => {
                    println!("archive: {:#?}", &archive);
                },
            }
        }
    }

goblin::parse would determine the filetype from the magic parameter.

m4b commented 6 years ago

What if you pass an invalid magic? It will have either error with bad magic (same problem I mentioned above) or return variant of Unknown, or still have signature like Result<Option<Object>>; which at that point may as well push all the fd + peek logic into a single function

sunfishcode commented 6 years ago

I'm suggesting peek return an Option. With invalid magic, peek would return None, and in that case you wouldn't call parse at all.

sunfishcode commented 6 years ago
     fn peek(fd: &mut File) -> Result<Option<Hint>>;
     fn parse(hint: Hint, bytes: &[u8]) -> Result<Object>;
m4b commented 6 years ago

Right, but what would the internal logic of parse(magic: usize, bytes: &[u8]) look like?

It has to still check the magic is valid - either it will return Option, Result with an Unknown variant, or panic, no?

m4b commented 6 years ago

Oh, I see what you're saying, I thought the peek was a usize:

     fn parse(hint: Hint, bytes: &[u8]) -> Result<Object> {
       match hint {
         Elf => // parse elf,
         Mach => // parse mach,
         PE => // parse pe,
         Archive => // parse pe,
         // exhausted
       }
}
m4b commented 6 years ago

Yea I like that idea; it makes using goblin from raw bytes slightly more complicated though.

Using your proposal I think I like this api:

     fn peek(fd: &mut File) -> Result<Option<Hint>>;
     fn parse_with(hint: Hint, bytes: &[u8]) -> Result<Object>;
     fn parse(fd: &mut File, bytes: &mut [u8]) -> Result<Option<Object>>;

For clients who already have the bytes but no fd it sort of makes goblin a pita to use though. I'm not sure how common this would be though. probably could be very common.

willglynn commented 6 years ago

One thought in reply to this:

if we peek, the if condition guarantees us that the Unknown variant cannot occur, hence its unreachable. But this is runtime information (hence the semi flippant dependent types remark)

There is no such guarantee. Files can change between reads, and mmap'd files can even change mid-read. Or, plain old user error: the user can pass one buffer into peek() and another into parse().

Even if you peek() successfully, parse() has to handle the invalid magic case.

m4b commented 6 years ago

@willglynn yea that's a good point.

However @sunfishcode 's option proposal I think somewhat sidesteps this issue.

It forces the user to pass a peek. They either got the peek from somewhere or fabricated it.

In either case, the parse will error; it'll interpret the bad bytes as an elf binary and will still return a scroll BadMagic error when attempting to parse the elf header.

The problem of the file changing, mmap bytes getting messed up is still a problem for current lib - it will just return BadMagic for elf if the header doesn't match and you try to parse it.

The advantage is that making peek an option is that it removes the unreachable case, or at least, moves it into the separate binary parsing backend errors.

willglynn commented 6 years ago

@m4b Sure. I just wanted to tamp down expectations. A peek() that finds a valid magic makes it highly likely that parse() will see the same thing, it just doesn't guarantee anything. parse()'s body and its API needs to accommodate that.

m4b commented 6 years ago

@willglynn yea i probably overstated things when I said it's guaranteed :laughing:

As a bonus I think I can also have my angry double ?? version using the option peek as well:

// or perhaps vexing_parse for our c++ brethren
pub fn angry_parse(bytes: &[u8]) -> Result<Result<Object, goblin::Error>, UnknownMagic> {
  match Hint::peek(bytes) {
    Some(hint) => // parse accordingly
    None => Err(UnknownMagic::from(bytes))
  }
}
m4b commented 5 years ago

I think my last suggestion is actually a good idea, or dan's; I think pre-release 1.0 one of them should be chosen.