m4b / goblin

An impish, cross-platform binary parsing crate, written in Rust
MIT License
1.19k stars 158 forks source link

Parse Strtab only once on Creation #275

Closed Lichtso closed 3 years ago

Lichtso commented 3 years ago

Together with @jackcmay I am working on a BPF VM, which utilizes goblin for loading user provided ELF files.

So, last week we encountered a problem with the way Strtab::get() works, tried to fix it outside, which did not work and now have found a fix which would require upstream changes. Furthermore, I think these changes can also help other projects depending on goblin which are exposed to potentially hand-crafted adversary / worst case inputs too.

The problem

A get request performs a linear scan to find the separators / NULL termination and possibly invalid UTF-8 variable length encodings on every single request. So it is possible to construct e.g. an ELF file which is only a few MiB in size but has massive overlap in the Strtab. This way everything involving the Strtab suddenly becomes O(n^2) in runtime complexity because of the overlap (as opposed to independent strings which can be amorphized to be constant length) and just reading the file can take minutes to hours.

What did not work

We tried to fix it on our side by caching the results of Strtab::get(). But

What does work

The proposed fix in this PR instead moves the parsing from the get request to the constructor of the Strtab, which in turn also moves the error handling with it. The trick is to use a binary search (in each get request) over the cached positions of the separators (found by one linear scan in the constructor). This is only done when dynamic (heap) allocations are enabled via #[cfg(feature = "alloc")], otherwise we default to the original behavior.

Missing

I did not address or test UTF-8 support yet, so it is probably broken. I will invest more time into it if this PR looks otherwise promising.

Lichtso commented 3 years ago

Could parsing the entire table at once have some perf consequences? But maybe not, majority of symbols might be looked up initially when we construct the dyn sym table + the symbol table, i'm not sure; do you have any insight into this aspect?

Uses cases which only request a few strings from a huge table will suffer, but even if you request every possible entry only once and there is overlap (shared suffices) then this already provides gains. Obviously, it would improve even further if an entry is requested multiple times.

lasty, i'll have to reread your explanation why doing an initial parse avoids the o(n^2) runtime when doing gets, because it's not clear to me how that would happen

The worst case is that the Strtab is made from one big string (no separators inside). And there are symbol name refs to each possible offset of the Strtab. Then each get request costs O(n) because it has to parse half of this one string on average. And there can be O(n) different get requests, even with caching as each offset is unique (even if the share the same suffixes). So it total you get O(n^2) for all get requests.

However, if the Strtab is parsed once (costing O(n)), and then each get request only does a binary search it costs just O(log(n)) per request or O(n*log(n)) in total, which is a lot better. Now the worst case is that the Strtab consists only of separators, but even that does not change the efficiency of the binary search and the resulting complexity class.

do you happen to have an elf binary you can share that exhibits this quadratic behavior?

We have multiple and I wouldn't want to publish them before this is resolved.

philipc commented 3 years ago

Do tools like readelf, objdump, llvm-readobj and the libraries they use also have this problem?

willglynn commented 3 years ago

This reminds me of pdb::ItemFinder. The situation there is similar: there's a data structure containing variable length items, which might be ill-formed, to which the user needs rapid access by index.

pdb::ItemFinder is to be updated by the user from a (fallible) iterator which parses one item at a time. The associated data structures form a directed acyclic graph, so the most common situation is to scan and then to refer to items which have already been seen, but separating the iterator and the finder allows (forces) the user to choose whichever index-building and error handling strategy is best for their application.

Additionally, given N records, pdb::ItemFinder stores N >> x offsets for some small value of x. This reduces memory requirements by a factor of 2^x in exchange for iterating over up to 2^x - 1 records to retrieve a particular item. Iteration is so cheap – the data's usually in the same cache line anyway – that real code using this structure is actually faster due to the reduced recordkeeping overhead. Neither pdb nor goblin yet depend on a version of Rust with const generics so it's not easy to let the user parameterize this tradeoff, but my suspicion is that a similar tradeoff for some small value of x would be worthwhile. Note also that this doesn't have to make retrieval fallible: if parsing a &[u8] worked before, it would be safe for get() to .expect() that parsing it will work again.

Lichtso commented 3 years ago

Do tools like readelf, objdump, llvm-readobj and the libraries they use also have this problem?

@philipc That is a good question, but I will have to craft different ELF files because the current ones are not completely valid in unrelated areas and it makes these tools return early with a warning.

@m4b I added two unit tests for UTF-8 parsing and the CI is passing, can you review it again?

m4b commented 3 years ago

We have multiple and I wouldn't want to publish them before this is resolved.

Could you clarify what you mean here? I.e., why would resolution of this issue enable you to publish / share the problematic binaries? It seems to me if you can't share them now (which is fine), you wouldn't be able to share them after, regardless of whether this issue is resolved.

Lichtso commented 3 years ago

We have multiple and I wouldn't want to publish them before this is resolved.

Could you clarify what you mean here? I.e., why would resolution of this issue enable you to publish / share the problematic binaries? It seems to me if you can't share them now (which is fine), you wouldn't be able to share them after, regardless of whether this issue is resolved.

We would simply like to keep using the upstream repo, not a fork with our security patches applied. That is all. Once this is merged we can deploy a new version of our software packages and release the exploits as they would be disarmed by the time. By release I mean first to you specifically and later to the general public, as we would want other projects depending on this one to have a change to adopt as well.

Lichtso commented 3 years ago

ok so other than to_vec being a breaking change, which if you feel strongly about, i guess i'm ok with it, the only other major concern i have is whether we're pessimizing 99% of binary's strtables in what seems like a very niche problem case.

Well, we have no data what the "usual" user of this crate is doing on average, so it is hard to optimize for. But, as I said, this is only penalizing use cases where only a few symbols are requested once. All others such as:

Would not be hit at all or even profit from the pre-parsing.

as far as i understand it, the only security implications I see are a kind of DoS on accessing strtables for these malformed binaries, but it's not clear to me how that is impactful, other than a tool taking a long time to run? it would be good to get some further motivations here.

I agree that many CLI tools some developer runs and cancels, if they take to long, are not affected by this. However, think about the automated processes like CI services, etc. Furthermore, we run a decentralized network which validates ELF files on the fly, so it would easily grind to halt if somebody wanted it to, which is not acceptable.

I'm not sure why, but I almost wish we had a variant constructor for Strtab that uses the pre-allocated form via a config we'd pass into parsing, and defaults to a non-preallocated form otherwise, but this is a much bigger PR. However, it might be worth doing, because a config for parsing options is something i've thought would be nice for a while.

with that in mind, i don't see anything in this PR that would prevent such a feature from landing in the future?

example:


enum StrtabKind {
  Preallocated(Vec<&str>),
  ZeroCopy
}

Might not be so complicated, why not just default back to the original implementation of get in get_at if the strings Vec is empty, because it was constructed by from_slice_unsafe not by parse?

That being said, maybe from_slice_unparsed would be a better name than from_slice_unsafe ...

Lichtso commented 3 years ago

I made the changes you requested, could you rerun the CI tests once more?

m4b commented 3 years ago

I made the changes you requested, could you rerun the CI tests once more?

I believe the CI has run on your last change Reverts breaking to_vec() result and adds a fallback mode., there is a green check mark at least next to it :)

Anyway, thank you for your patience! I think this is ready to merge; perhaps we can do the enum variant I talked about earlier, but it's fine for now I believe; we can do a minor crates release as well, since we don't (i believe!) have any breaking changes.

Thanks again for everyone's input here!

Lichtso commented 3 years ago

Thank you as well.

Once you release the minor crate version I will send you the exploits via email to m4b.github.io a gmail.com

m4b commented 3 years ago

ok i've published 0.4.2; do note that i caught a breaking change in the new function whose return signature changed; i restored the original new, and added a new_preparsed that has the Result return type so that a minor release should be non-breaking now. So if you had a branch using these changes, you'll have to rename your Strtab::new -> Strtab::new_preparsed is all.

Thanks for the PR!