Open cauliflowerdoughnuts opened 5 months ago
Minimal repro: UnicodeTest.zip
from dotnetfile import DotNetPE
file = DotNetPE("UnicodeTest.exe")
print(file.get_user_string(1))
This seems to be because parse_us_stream
may default to UTF-8 encoding as opposed to UTF-16LE because it uses get_reasonable_display_string_for_bytes
to decode the next string during the linear sweep of the stream.
get_reasonable_display_string_for_bytes
uses convert_to_unicode
, which attempts to infer the encoding of a string by checking whether the first 8 bytes look like a widened ASCII string.
This heuristic does not work if the string starts with only non-ASCII strings, such as the one in the example binary.
get_reasonable_display_string_for_bytes
can just be replaced with a direct unicode decoding call. The "downside" of course is that you may get encrypted strings like the ones found by OP using dnSpy, but in my opinion this is a feature and not a bug.
When extracting US strings, the encoding does not match the expected output. File: https://www.virustotal.com/gui/file/ab9cd59d789e6c7841b9d28689743e700d492b5fae1606f184889cc7e6acadcc
dotnetfile output:
dnlib output:
dnspy output: