Closed Lichtso closed 3 years ago
Could parsing the entire table at once have some perf consequences? But maybe not, majority of symbols might be looked up initially when we construct the dyn sym table + the symbol table, i'm not sure; do you have any insight into this aspect?
Uses cases which only request a few strings from a huge table will suffer, but even if you request every possible entry only once and there is overlap (shared suffices) then this already provides gains. Obviously, it would improve even further if an entry is requested multiple times.
lasty, i'll have to reread your explanation why doing an initial parse avoids the o(n^2) runtime when doing gets, because it's not clear to me how that would happen
The worst case is that the Strtab
is made from one big string (no separators inside). And there are symbol name refs to each possible offset of the Strtab
. Then each get
request costs O(n) because it has to parse half of this one string on average. And there can be O(n) different get
requests, even with caching as each offset is unique (even if the share the same suffixes). So it total you get O(n^2) for all get
requests.
However, if the Strtab
is parsed once (costing O(n)), and then each get
request only does a binary search it costs just O(log(n)) per request or O(n*log(n)) in total, which is a lot better. Now the worst case is that the Strtab
consists only of separators, but even that does not change the efficiency of the binary search and the resulting complexity class.
do you happen to have an elf binary you can share that exhibits this quadratic behavior?
We have multiple and I wouldn't want to publish them before this is resolved.
Do tools like readelf
, objdump
, llvm-readobj
and the libraries they use also have this problem?
This reminds me of pdb::ItemFinder
. The situation there is similar: there's a data structure containing variable length items, which might be ill-formed, to which the user needs rapid access by index.
pdb::ItemFinder
is to be updated by the user from a (fallible) iterator which parses one item at a time. The associated data structures form a directed acyclic graph, so the most common situation is to scan and then to refer to items which have already been seen, but separating the iterator and the finder allows (forces) the user to choose whichever index-building and error handling strategy is best for their application.
Additionally, given N
records, pdb::ItemFinder
stores N >> x
offsets for some small value of x
. This reduces memory requirements by a factor of 2^x
in exchange for iterating over up to 2^x - 1
records to retrieve a particular item. Iteration is so cheap – the data's usually in the same cache line anyway – that real code using this structure is actually faster due to the reduced recordkeeping overhead. Neither pdb
nor goblin
yet depend on a version of Rust with const generics so it's not easy to let the user parameterize this tradeoff, but my suspicion is that a similar tradeoff for some small value of x
would be worthwhile. Note also that this doesn't have to make retrieval fallible: if parsing a &[u8]
worked before, it would be safe for get()
to .expect()
that parsing it will work again.
Do tools like
readelf
,objdump
,llvm-readobj
and the libraries they use also have this problem?
@philipc That is a good question, but I will have to craft different ELF files because the current ones are not completely valid in unrelated areas and it makes these tools return early with a warning.
@m4b I added two unit tests for UTF-8 parsing and the CI is passing, can you review it again?
We have multiple and I wouldn't want to publish them before this is resolved.
Could you clarify what you mean here? I.e., why would resolution of this issue enable you to publish / share the problematic binaries? It seems to me if you can't share them now (which is fine), you wouldn't be able to share them after, regardless of whether this issue is resolved.
We have multiple and I wouldn't want to publish them before this is resolved.
Could you clarify what you mean here? I.e., why would resolution of this issue enable you to publish / share the problematic binaries? It seems to me if you can't share them now (which is fine), you wouldn't be able to share them after, regardless of whether this issue is resolved.
We would simply like to keep using the upstream repo, not a fork with our security patches applied. That is all. Once this is merged we can deploy a new version of our software packages and release the exploits as they would be disarmed by the time. By release I mean first to you specifically and later to the general public, as we would want other projects depending on this one to have a change to adopt as well.
ok so other than
to_vec
being a breaking change, which if you feel strongly about, i guess i'm ok with it, the only other major concern i have is whether we're pessimizing 99% of binary's strtables in what seems like a very niche problem case.
Well, we have no data what the "usual" user of this crate is doing on average, so it is hard to optimize for. But, as I said, this is only penalizing use cases where only a few symbols are requested once. All others such as:
Would not be hit at all or even profit from the pre-parsing.
as far as i understand it, the only security implications I see are a kind of DoS on accessing strtables for these malformed binaries, but it's not clear to me how that is impactful, other than a tool taking a long time to run? it would be good to get some further motivations here.
I agree that many CLI tools some developer runs and cancels, if they take to long, are not affected by this. However, think about the automated processes like CI services, etc. Furthermore, we run a decentralized network which validates ELF files on the fly, so it would easily grind to halt if somebody wanted it to, which is not acceptable.
I'm not sure why, but I almost wish we had a variant constructor for Strtab that uses the pre-allocated form via a config we'd pass into parsing, and defaults to a non-preallocated form otherwise, but this is a much bigger PR. However, it might be worth doing, because a config for parsing options is something i've thought would be nice for a while.
with that in mind, i don't see anything in this PR that would prevent such a feature from landing in the future?
example:
enum StrtabKind { Preallocated(Vec<&str>), ZeroCopy }
Might not be so complicated, why not just default back to the original implementation of get
in get_at
if the strings
Vec
is empty, because it was constructed by from_slice_unsafe
not by parse
?
That being said, maybe from_slice_unparsed
would be a better name than from_slice_unsafe
...
I made the changes you requested, could you rerun the CI tests once more?
I made the changes you requested, could you rerun the CI tests once more?
I believe the CI has run on your last change Reverts breaking to_vec() result and adds a fallback mode.
, there is a green check mark at least next to it :)
Anyway, thank you for your patience! I think this is ready to merge; perhaps we can do the enum variant I talked about earlier, but it's fine for now I believe; we can do a minor crates release as well, since we don't (i believe!) have any breaking changes.
Thanks again for everyone's input here!
Thank you as well.
Once you release the minor crate version I will send you the exploits via email to m4b.github.io a gmail.com
ok i've published 0.4.2; do note that i caught a breaking change in the new
function whose return signature changed; i restored the original new, and added a new_preparsed
that has the Result return type so that a minor release should be non-breaking now. So if you had a branch using these changes, you'll have to rename your Strtab::new
-> Strtab::new_preparsed
is all.
Thanks for the PR!
Together with @jackcmay I am working on a BPF VM, which utilizes goblin for loading user provided ELF files.
So, last week we encountered a problem with the way
Strtab::get()
works, tried to fix it outside, which did not work and now have found a fix which would require upstream changes. Furthermore, I think these changes can also help other projects depending on goblin which are exposed to potentially hand-crafted adversary / worst case inputs too.The problem
A
get
request performs a linear scan to find the separators /NULL
termination and possibly invalid UTF-8 variable length encodings on every single request. So it is possible to construct e.g. an ELF file which is only a few MiB in size but has massive overlap in theStrtab
. This way everything involving theStrtab
suddenly becomes O(n^2) in runtime complexity because of the overlap (as opposed to independent strings which can be amorphized to be constant length) and just reading the file can take minutes to hours.What did not work
We tried to fix it on our side by caching the results of
Strtab::get()
. ButStrtab
on average. So still O(n^2) runtime complexity in total.What does work
The proposed fix in this PR instead moves the parsing from the
get
request to the constructor of theStrtab
, which in turn also moves the error handling with it. The trick is to use a binary search (in eachget
request) over the cached positions of the separators (found by one linear scan in the constructor). This is only done when dynamic (heap) allocations are enabled via#[cfg(feature = "alloc")]
, otherwise we default to the original behavior.Missing
I did not address or test UTF-8 support yet, so it is probably broken. I will invest more time into it if this PR looks otherwise promising.