Support switching different encodings in the Strings tab

tompazourek commented 3 years ago

Just discovered this tool, and I have to say that I'm surprised how good and easy-to-use it is.

I really like the "Strings" tab feature, and I think it would be great if it supported multiple encodings. Maybe there could be a dropdown (similar as the one "Decoded values" tabs allows switching from LE to BE).

I have some DLLs open, and they contain some ASCII-encoded strings, but also some UTF16-LE ones. It would be great if rehex could detect those too. Just an idea. Maybe it doesn't need to support every encoding in the world. For me personally, just the ability to switch between ASCII, UTF-8, UTF-16LE, UTF-16BE would be more than enough.

Thanks for the great work on this!

EDIT: Just found that this is probably a duplicate of #10...

solemnwarning commented 3 years ago

So, on the encoding branch, I've added a "Text encoding" option to the context menu alongside the "Data type" one. These aren't mutually exclusive - so you can set a range of bytes as (e.g.) "x86 machine code", and also independently set the text encoding used for decoding the text column on the right.

This is probably not very helpful - there are some architectures/eras where mixing of code and data is apparently common, but even on such things an unbroken block of data/code wouldn't disassembly properly past the first data blob (unless the architecture happens to have fixed-length instructions and the data blob preserves alignment, or the data is followed by a big wall of nops or something).

I'm leaning towards taking the "Text encoding" option back out, and make encoding a "Data type" type instead (e.g. "UTF-16 encoded text"), so in disassemblies with mixed code/data they can be marked as such.

I'm not yet sure how best to integrate it with the "Strings" tab - if the Strings tab relies on the encodings defined in the file, its arguably a bit useless since you already know where the text is to have marked it as such, OTOH, if it had its own encoding selection, then it isn't going to know how to handle mixed-encoding files, and won't make use of the encoding annotations already added.

Finally, I'm not sure how the "Strings" tab should detect what actually is a string in this scary new international world - right now it just looks for sequences of "printable" ASCII characters, but what does "printable" mean in Unicode? Printable ASCII + any 8-bit character? I'm pretty sure Unicode has control characters in it too.

If anyone has opinions on the above, now's the time to voice them!

solemnwarning commented 2 years ago

Just a preview of how the new strings tab is looking (currently on strings-encoding branch):

strings

tompazourek commented 2 years ago

@solemnwarning This is looking really great. Thank you!

solemnwarning commented 2 years ago

Merged to master

solemnwarning / rehex

Support switching different encodings in the Strings tab #106