m4b / goblin

An impish, cross-platform binary parsing crate, written in Rust
MIT License
1.17k stars 156 forks source link

Return `CStr` in `goblin::strtab::Strtab` instead of `str` #345

Open koutheir opened 1 year ago

koutheir commented 1 year ago

There is no way to get access to null-terminated C strings that are not valid UTF-8, through goblin::strtab::Strtab. Strtab.get() and Strtab.get_at() should return Option<&CStr> instead of optional UTF-8 strs.

m4b commented 1 year ago

This would be a pretty enormous breaking change for goblin, and I'd hate to think of the ergonomic issues around dealing with strings from this Strtab after a change like this. Do you have a particular example where using rusts str doesn't work today (e.g., a string in a binary that isn't valid utf-8?) I assume this issue would have come up before if people were encountering it in the wild?

willglynn commented 1 year ago

@m4b I agree about the breakage/ergonomics concerns, but I also agree that those strings are just a bag of bytes with unknown encoding and a NUL terminator, i.e. CStr.

What would you think about adding get_cstr() and get_at_cstr(), implementing Strtab in terms of those, and moving UTF-8 validation into get()/get_at()?

m4b commented 1 year ago

get_cstr() sounds like a great compromise, did not think of that :)

koutheir commented 1 year ago

This would be a pretty enormous breaking change for goblin

Isn't this the reason behind goblin still in major version zero?

and I'd hate to think of the ergonomic issues around dealing with strings from this Strtab after a change like this

Transforming a &CStr into a &str is fairly easy, and it is honest about the reality of ELF string tables.

Do you have a particular example where using rusts str doesn't work today (e.g., a string in a binary that isn't valid utf-8?)

No, but I don't want to assume UTF-8 encoding, especially if I'm building tools around goblin.

What would you think about adding get_cstr() and get_at_cstr(), implementing Strtab in terms of those, and moving UTF-8 validation into get()/get_at()?

I would argue the opposite, because UTF-8 validation is an additional operation that should not be included in the default get(). Providing a more ergonomic, but also more expensive, get_str() seems better to me.

m4b commented 1 year ago

So for now I don't think I want to proceed with this; it will be too intrusive a change for everyone involved. I'm not convinced that switching everything to Cstr is correct without a compelling example where we fail today. As far as I can see, it should remain sound to return &str on all platforms binaries since they are either regular ansi encoded strings (and hence compatible with utf-8). While it may be true that something could emit a non-utf8 string, and this is used somehow by something (e.g., a function call as a sequence of bytes instead of a regular string) but I haven't seen any binaries in the wild yet that have this feature (and I suspect most binary tools just assume that strings in the string table are char* (hence utf-8 compatible) and make many assumptions about this) but I could be wrong. Regardless I think I would require a non manufactured/stilted example that is compelling in order to justify what on the face of it is a pretty user hostile/annoying change. The long of it is that people have been using this library for many years without complaining that the strings returned from string table aren't CStr, so I'm not sure why it would be important now, but this could perhaps be survivorship bias? Lastly, we could always add a get_at_cstr() option as suggested for those who want to opt into this kind of feature if it becomes urgent somehow.