Open koutheir opened 2 years ago
This would be a pretty enormous breaking change for goblin, and I'd hate to think of the ergonomic issues around dealing with strings from this Strtab after a change like this. Do you have a particular example where using rusts str doesn't work today (e.g., a string in a binary that isn't valid utf-8?) I assume this issue would have come up before if people were encountering it in the wild?
@m4b I agree about the breakage/ergonomics concerns, but I also agree that those strings are just a bag of bytes with unknown encoding and a NUL terminator, i.e. CStr
.
What would you think about adding get_cstr()
and get_at_cstr()
, implementing Strtab
in terms of those, and moving UTF-8 validation into get()
/get_at()
?
get_cstr()
sounds like a great compromise, did not think of that :)
This would be a pretty enormous breaking change for goblin
Isn't this the reason behind goblin still in major version zero?
and I'd hate to think of the ergonomic issues around dealing with strings from this Strtab after a change like this
Transforming a &CStr
into a &str
is fairly easy, and it is honest about the reality of ELF string tables.
Do you have a particular example where using rusts str doesn't work today (e.g., a string in a binary that isn't valid utf-8?)
No, but I don't want to assume UTF-8 encoding, especially if I'm building tools around goblin.
What would you think about adding
get_cstr()
andget_at_cstr()
, implementing Strtab in terms of those, and moving UTF-8 validation intoget()
/get_at()
?
I would argue the opposite, because UTF-8 validation is an additional operation that should not be included in the default get()
. Providing a more ergonomic, but also more expensive, get_str()
seems better to me.
So for now I don't think I want to proceed with this; it will be too intrusive a change for everyone involved. I'm not convinced that switching everything to Cstr is correct without a compelling example where we fail today.
As far as I can see, it should remain sound to return &str on all platforms binaries since they are either regular ansi encoded strings (and hence compatible with utf-8). While it may be true that something could emit a non-utf8 string, and this is used somehow by something (e.g., a function call as a sequence of bytes instead of a regular string) but I haven't seen any binaries in the wild yet that have this feature (and I suspect most binary tools just assume that strings in the string table are char* (hence utf-8 compatible) and make many assumptions about this) but I could be wrong.
Regardless I think I would require a non manufactured/stilted example that is compelling in order to justify what on the face of it is a pretty user hostile/annoying change.
The long of it is that people have been using this library for many years without complaining that the strings returned from string table aren't CStr, so I'm not sure why it would be important now, but this could perhaps be survivorship bias?
Lastly, we could always add a get_at_cstr()
option as suggested for those who want to opt into this kind of feature if it becomes urgent somehow.
There is no way to get access to null-terminated C strings that are not valid UTF-8, through
goblin::strtab::Strtab
.Strtab.get()
andStrtab.get_at()
should returnOption<&CStr>
instead of optional UTF-8str
s.