pan-unit42 / dotnetfile

MIT License
97 stars 16 forks source link

US Strings Wrong Encoding? #13

Open cauliflowerdoughnuts opened 5 months ago

cauliflowerdoughnuts commented 5 months ago

When extracting US strings, the encoding does not match the expected output. File: https://www.virustotal.com/gui/file/ab9cd59d789e6c7841b9d28689743e700d492b5fae1606f184889cc7e6acadcc

dotnetfile output:

ñÜY†}†}ÛTÜ×Ûh†oÛ©Ùh†²ÛÊÜ@ÌpÜ4†4†¯Üò†ÖÜúÛ    Ù
e6196fd98b57
íۜ†ÚÛÇ]Û܄ÜÂÙÜÛ@,Üt†£Û†›ÜՆÕÛÙ    ÜãÜ
d4bd11ffd15f9756710a
Ãܵ†ßÛ ÜÍÜüÙ§ÙÙZÛëÛ$†ÖٗÙÁ†­ÙÃ'†è×ÛÒٕÜ÷Û)Û²ÛÙÜ
89cdbd2d

dnlib output:

ﱉۨې﯐ﳣ﮿ڈﮖﶴڈﯪﱲݼݸﲗ۴۴ﲼۍﳬﯞﴅ
e6196fd98b57
ﭕ؄ﯾݨ﮻ﰓﰤﵢﯼݼﱫڣﮱ؀ﰻۯﯯ﴿ﰅﱆ
d4bd11ffd15f9756710a
ﱦڠﭙﱀﱵ﷜ﶵﴮﯩﭓٛ﷬ﴈ٥﷊ݦٽݔ﮿﷭ﰵﯡﭝﯪﴀﰙ
89cdbd2d

dnspy output:

<Module>.smethod_0("ﱉۨې﯐ﳣ﮿ڈﮖﶴڈﯪﱲݼݸﲗ۴۴ﲼۍﳬﯞﴅ");
<Module>.smethod_1("e6196fd98b57");
<Module>.smethod_0("ﭕ؄ﯾݨ﮻ﰓﰤﵢﯼݼﱫڣﮱ؀ﰻۯﯯ﴿ﰅﱆ");
<Module>.smethod_1("d4bd11ffd15f9756710a");
<Module>.smethod_0("ﱦڠﭙﱀﱵ﷜ﶵﴮﯩﭓٛ﷬ﴈ٥﷊ݦٽݔ﮿﷭ﰵﯡﭝﯪﴀﰙ");
<Module>.smethod_1("89cdbd2d");
Washi1337 commented 5 months ago

Minimal repro: UnicodeTest.zip

from dotnetfile import DotNetPE

file = DotNetPE("UnicodeTest.exe")
print(file.get_user_string(1))

This seems to be because parse_us_stream may default to UTF-8 encoding as opposed to UTF-16LE because it uses get_reasonable_display_string_for_bytes to decode the next string during the linear sweep of the stream.

https://github.com/pan-unit42/dotnetfile/blob/c7ce2c58657ebfca44a27bd464e093af427301d8/dotnetfile/parser.py#L688

get_reasonable_display_string_for_bytes uses convert_to_unicode, which attempts to infer the encoding of a string by checking whether the first 8 bytes look like a widened ASCII string.

https://github.com/pan-unit42/dotnetfile/blob/c7ce2c58657ebfca44a27bd464e093af427301d8/dotnetfile/util.py#L35-L40

This heuristic does not work if the string starts with only non-ASCII strings, such as the one in the example binary.

US strings are always 2 bytes per character as per specification, and the runtime always has assumed so as well since inception of .NET. I say the call to get_reasonable_display_string_for_bytes can just be replaced with a direct unicode decoding call. The "downside" of course is that you may get encrypted strings like the ones found by OP using dnSpy, but in my opinion this is a feature and not a bug.