m4b / goblin

An impish, cross-platform binary parsing crate, written in Rust
MIT License
1.2k stars 160 forks source link

elf.parse: added lazy_parse function #254

Closed jessehui closed 3 years ago

jessehui commented 3 years ago

We are developing a libOS kernel (https://github.com/occlum/occlum) and would like to load ELF as lazy as possible. However, now goblin does all the read and parse in memory so we must read the whole ELF file to kernel memory (we can't mmap the file to memory due to some constraint).

Thus, I came up with this method to load lazily: read ELF header only -> parse header -> get program header table -> only read in PT_LOAD and PT_INTERP segments

And by adding these two functions, I can achieve this. The implementation can be found here. I would like to see this get merged to master so that people with similar requirements can do this. But I am not sure if it is a common need. If it is, I can spend some time to reduce the repeated code.

Comments are welcome. Thank you.

m4b commented 3 years ago

Cool! Yea this seems like a great compromise to get a lazy parsing approach, I like it! Reducing some of the boilerplate into common function would be good, as you said, but i understand that this is just an initial sketch/PR?

What api were you thinking for filling in the rest of the structures, e.g., section headers, etc.?

And thanks for the PR! :)

jessehui commented 3 years ago

Thank you for your reply. Glad you are interested. I push a new version to reduce the repeated code, make API more general and add a unit test which can also be an example.

What api were you thinking for filling in the rest of the structures, e.g., section headers, etc.?

With goblin's well-designed APIs for each part (program headers, section headers ...), user can read in the contents they need and parse the specific part with corresponding parse function. And then, they can put it to the dummy Elf struct returned from lazy_parse. I think the unit test can be seen as an example (here). We will do the parsing similar for the libOS. Just we only care about PT_INTERP and PT_LOAD segments.

You may start the review. Thank you!

vfsfitvnm commented 3 years ago

I second this. In my specific use case, I just need to parse the elf file header in order to get e_ident[EI_CLASS] and e_machine. It would be a lot nicer if there was a straightforward way to do that (https://github.com/m4b/goblin/pull/254/files#diff-696bbf0d8b3f75177e0fa931a45c3e85057e2078598ee1cf8040393730b30633R206), instead of include scroll as a direct dependency.

jessehui commented 3 years ago

It seems that I can't fill in all parts without gnu_hash_len and/or hash_len... However, I think if users would like to go this far, maybe it is better to call parse instead of lazy_parse as it will need a lot more effort to parse almost everything. But for users who just need some specific sections or segments, the lazy_parse can bring much convenience. I can write an example to extract the interpreter segments lazily (without reading the whole elf file). What do you think?

m4b commented 3 years ago

Example getting interpreter would be great! Re gnu hash, let's leave it private for now, that way we don't commit to an api; if someone needs this functionality, we can always make a PR/take a PR and review how best to expose it.

                let mut num_syms = if let Some(gnu_hash) = dyn_info.gnu_hash {
                    gnu_hash_len(bytes, gnu_hash as usize, ctx)?
                } else if let Some(hash) = dyn_info.hash {
                    hash_len(bytes, hash as usize, header.e_machine, ctx)?
                } else {
                    0
                };
                let max_reloc_sym = dynrelas.iter()
                    .chain(dynrels.iter())
                    .chain(pltrelocs.iter())
                    .fold(0, |num, reloc| cmp::max(num, reloc.r_sym));
                if max_reloc_sym != 0 {
                    num_syms = cmp::max(num_syms, max_reloc_sym + 1);
                }
                dynsyms = Symtab::parse(bytes, dyn_info.symtab, num_syms, ctx)?;

as far as i can see, this is only code that uses it; we could just a a helper function that does all this and takes whichever minimal arguments requierd, seems like dyn_info, ctx, bytes, and dyn_relas and header :shrug:

For now i think it's better to leave private, as someone can always do as you say and parse whole thing :)

m4b commented 3 years ago

I can add this change onto the 0.3 branch and do (another :D ) release, 0.3.4 if you like?

jessehui commented 3 years ago

@m4b Thank you for your review. Yeah sure, I was just going to ask whether there will be a recent release with this feature that can be used directly from crates.io?

m4b commented 3 years ago

You got it, 0.3.4 is out!

jessehui commented 3 years ago

Great! Thank you very much!