Registry code needs converting to the new object model

GoogleCodeExporter commented 9 years ago


when volatility hashdump -f mem.dump -y ... -s ... I've got:

File "/home/tb/work/forensics/volatility-2.0/volatility/win32/hashdump.py", 
line 319, in dump_hashes
    lmhash.encode('hex'), nthash.encode('hex'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-12: 
ordinal not in range(128) 

I use Volatile Systems Volatility Framework 2.0 under ubuntu linux

Memory dump was created by FTK Imager Lite, under Windows XPSP3 with russian 
localization and russian user names.

Original issue reported on code.google.com by olbioua%...@gtempaccount.com on 14 Nov 2011 at 2:46

Merged into: #521

GoogleCodeExporter commented 9 years ago

I can confirm this. Here's some more information. 

The error message is a little misleading:

File "/home/tb/work/forensics/volatility-2.0/volatility/win32/hashdump.py", 
line 319, in dump_hashes
    lmhash.encode('hex'), nthash.encode('hex'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-12: 
ordinal not in range(128) 

The lmash.encode('hex') and nthash.encode('hex') lines are just fine. Its 
really crashing on the first part of that statement which isn't shown:

yield "{0}:{1}:{2}:{3}:::".format(get_user_name(user), int(str(user.Name), 16),
                                              lmhash.encode('hex'), nthash.encode('hex'))

The value returned by get_user_name(user) is what contains the characters 
causing the UnicodeEncodeError. Just FYI these values can be printed fine, but 
not formatted. For example, I modified my copy of:

http://code.google.com/p/volatility/source/browse/trunk/volatility/win32/hashdum
p.py#318

And instead of the above yield statement, I made it do this instead:

the_rest = ":{0}:{1}:{2}:::".format(int(str(user.Name), 16),
                                              lmhash.encode('hex'), nthash.encode('hex'))
print get_user_name(user), the_rest

Now when I run the plugin on a russian memory dump, I see:

Администратор 
:500:aad3b435b51404eeaad3b435b51404ee:31d6cfe0d16ae931b73c59d7e0c089c0:::

Hopefully that shows up correctly here in google code. For other people reading 
this issue, the value on the far left is "Administrator" in russian. 

Ikelos, Scudette, do you have any thoughts on how to format user names in these 
other languages properly?

Original comment by michael.hale@gmail.com on 14 Nov 2011 at 3:37

GoogleCodeExporter commented 9 years ago

Welcome to the hell that is unicode :-)

What we are seeing here is a difference in behaviour between the python 2 
string interpolation mechanism and the new 3.0 (really 2.6+) format mechanism. 
The behaviour is different when interpolating mixed unicode and string objects 
together. For example:

a="%s %s" % (u"Администратор", "hello")
print type(a)
<type 'unicode'>

The single unicode parameter is causing the format string to be coerced into 
unicode type.

On the other hand this:

a="{0} {1}".format(u"Администратор", "hello")
raises <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode 
characters in position 0-12: ordinal not in range(128)

So the new behaviour forces the first parameter down into string, rather then 
the format string into unicode. This is a better choice because it causes 
ambiguous code like above to break early and more frequently (so it can be 
fixed).

The solution is to explicitely make the format string a unicode object - e.g.
a=u"{0} {1}".format(u"Администратор", "hello")

Of course this will break if the second arg can not be encoded in utf8 
correctly, so we need to be careful.

In the above this fix should work:

yield u"{0}:{1}:{2}:{3}:::".format(get_user_name(user), int(str(user.Name), 16),
                                   lmhash.encode('hex'), nthash.encode('hex'))

Original comment by scude...@gmail.com on 14 Nov 2011 at 4:26

GoogleCodeExporter commented 9 years ago

Thanks for the explanation Scudette!

Hmm when I try yield u"" that works but there's a UnicodeEncodeError when the 
yielded unicode string is passed to outfd.write here:

outfd.write(d + "\n")

http://code.google.com/p/volatility/source/browse/trunk/volatility/plugins/regis
try/lsadump.py#101

I changed it to outfd.write(d + u"\n") and also put the line break at the end 
of the yielded unicode string and then just did outfd.write(d). They both still 
cause an exception. 

This is odd since type(d) is obviously unicode and outfd.write(u"testing") 
alone has no issues.

Original comment by michael.hale@gmail.com on 14 Nov 2011 at 4:47

GoogleCodeExporter commented 9 years ago

Mike, the secret to understanding unicode in python is as follows:

- A unicode object is an abstract object which can only exist inside the 
program. It is not real and has no physical representation.

- When you need to send or receive a unicode object from your program you must 
represent it in a concrete way (i.e. in bytes). This is called encoding or 
decoding. You must explicitely encode or decode (because the default is and 
implicit ascii conversion - which will break frequently).

So in this case the unicode object lives inside the bowls of volatility until 
it hits the line outfd.write(d + "\n"). At this point it must be encoded to 
bytes (you can not write a unicode object to a physical thing like the terminal 
or the network etc).

So you must encode it on output (and decode it on input):
outfd.write(d.encode("utf8") + "\n"). We typically wrap these calls with 
helpers:

outfd.write(SmartStr(d+"\n"))

Where SmartStr guarantees that the output will be a string encoded to utf8 if 
necessary. These calls must only ever exist on interfaces from the program 
(i.e. when data is leaving the program it must be encoded - when its arriving 
into the program it must be decoded - all data internally must be unicode 
objects). This makes it easy to see the external interfaces for your code.

Note that outfd.write(d + u"\n") is the same as outfd.write(d + "\n") because 
if d is a unicode object, the "\n" will automatically be converted to unicode 
anyway.

Original comment by scude...@gmail.com on 14 Nov 2011 at 5:08

GoogleCodeExporter commented 9 years ago

Got it, thanks for teaching me something new ;-)

The outfd.write(d.encode("utf8") + "\n") indeed does work. 

But regarding the SmartStr(), when you say "We typically wrap these calls with 
helpers" do you mean in other projects? I haven't seen it done in Volatility so 
just wondering if I missed it and if there's already a SmartStr()-like function 
defined somewhere. 

Also I suppose we need to do a more thorough analysis of what other plugins may 
need this type of helper function (like filescan, mutantscan, etc).

Original comment by michael.hale@gmail.com on 15 Nov 2011 at 3:49

GoogleCodeExporter commented 9 years ago

Saw this on twitter and may be a good reference (scudette probably knows it 
already, but for others):

http://farmdev.com/talks/unicode/

Original comment by michael.hale@gmail.com on 18 Jan 2012 at 3:15

GoogleCodeExporter commented 9 years ago

Original comment by mike.auty@gmail.com on 10 Mar 2012 at 11:38

Added labels: Task-Unicode

GoogleCodeExporter commented 9 years ago

Ok, so this is mostly being caused by the registry code relying on vm.reads and 
structs for parsing, rather than using the Object model.  As such, here's an 
interim patch for review (it's not a good solution, but it should reduce the 
errors and allow the plugins to complete), and I've updated the summary to 
reflect that the registry code needs modernizing...

Original comment by mike.auty@gmail.com on 22 Mar 2012 at 12:46

Changed title: Registry code needs converting to the new object model to reduce unicode errors

Attachments:

volatility-registry-unicode.patch

GoogleCodeExporter commented 9 years ago

Hey Mike, 

Thanks for the patch, there are a few related things that need fixing before I 
can fully test it. One of the small fixes I applied in r1571. 

Here's the other one (hashdump):

Volatile Systems Volatility Framework 2.1_alpha
Traceback (most recent call last):
  File "vol.py", line 173, in <module>
    main()
  File "vol.py", line 164, in main
    command.execute()
  File "/Users/mhl/volatility-hashdump/volatility/commands.py", line 101, in execute
    func(outfd, data)
  File "/Users/mhl/volatility-hashdump/volatility/plugins/registry/lsadump.py", line 98, in render_text
    for d in data:
  File "/Users/mhl/volatility-hashdump/volatility/win32/hashdump.py", line 314, in dump_hashes
    lmhash, nthash = get_user_hashes(user, hbootkey)
TypeError: 'NoneType' object is not iterable

So the problem is hashdump.get_user_hashes() returns None or a tuple. The 
calling function expects a tuple:

lmhash, nthash = get_user_hashes(user, hbootkey)

So I would fix this myself, but I'm not sure the best way to do it. Should we 
return None, None or call get_user_hashes in a try/except block while catching 
TypeError? 

Here are few discussions - what's the recommended way for volatility code? 

http://stackoverflow.com/questions/3448701/function-returning-a-tuple-or-none-ho
w-to-call-that-function-nicely

http://stackoverflow.com/questions/1274875/returning-none-or-a-tuple-and-unpacki
ng

Original comment by michael.hale@gmail.com on 23 Mar 2012 at 6:21

GoogleCodeExporter commented 9 years ago

Hi Mike,
   I think the function should either return an empty tuple or raise
an exception, depending on what can be done about it. I am leaning
towards just returning an empty tuple.

Michael.

Original comment by scude...@gmail.com on 23 Mar 2012 at 10:17

GoogleCodeExporter commented 9 years ago

Well, we're going to have to do the check somewhere, options are:

a, b = func()
if not a or not b:
  blowup()
dostuff()

ret = func()
if not ret:
  blowup()
a, b = ret
dostuff()

try:
  a, b = func()
  dostuff()
except SpecificException:
  blowup()

It's a style choice, I'd probably go for the second or first option, but even 
the third option would be fine as long as you're catching a very specific 
exception.  Your choice...

Original comment by mike.auty@gmail.com on 24 Mar 2012 at 12:32

GoogleCodeExporter commented 9 years ago

Hey guys, in r1587 I fixed the issues discussed in comment 9-11. The patch from 
Ikelos in comment 8 seems fine with my testing so far. 

So there are no more UnicodeEncodeErrors which is great. Regarding the russian 
user name in hashdump output, now you're going to see:

?????:501:<HASH HERE>

So eliminated the unicode decode errors by converting the non-printable chars 
to '?' but what if someone really wants/needs to see 
Администратор? What options do they have?

Original comment by michael.hale@gmail.com on 2 Apr 2012 at 4:23

GoogleCodeExporter commented 9 years ago

Please note some of this discussion was picked up in the comments of r1587:

http://code.google.com/p/volatility/source/detail?r=1587

Original comment by michael.hale@gmail.com on 4 Apr 2012 at 1:56

GoogleCodeExporter commented 9 years ago

As one of the enhancements when updating the registry code, we should change 
these lines in printkey.py:

if tp == 'REG_BINARY':
                        dat = "\n" + "\n".join(["{0:#010x}  {1:<48}  {2}".format(o, h, ''.join(c)) for o, h, c in utils.Hexdump(dat)])
                    if tp in ['REG_SZ', 'REG_EXPAND_SZ', 'REG_LINK']:
                        dat = dat.encode("ascii", 'backslashreplace')
                    if tp == 'REG_MULTI_SZ':

We should use:

if tp == 'REG_BINARY':
   ;
elif tp in [....]:
   ;
elif tp == 'REG_MULTI_SZ'

Original comment by michael.hale@gmail.com on 4 Apr 2012 at 1:57

GoogleCodeExporter commented 9 years ago

Gleeda do you want to try and get your registry API stuff in before 2.1? If so, 
this might be a good time since there will be at least one good round of 
registry cleanup happening soon.

Original comment by michael.hale@gmail.com on 4 Apr 2012 at 1:58

GoogleCodeExporter commented 9 years ago

Yes, I would :-)  Should I create another issue for that?

Original comment by jamie.l...@gmail.com on 4 Apr 2012 at 2:08

GoogleCodeExporter commented 9 years ago

You might note that I am currently undergoing a rewrite of the registry support 
in my branch so it might make sense to look there first.

Also, I am not sure at all that unicode support is really addressed at the 
moment in trunk. I think its going to be hard to properly fix it until we have 
some way to enforce an output encoding. For example in the new framework we 
wrap stdout in a class which ensures utf8 encoding (otherwise it will raise 
when we have stdout being a pipe).

Original comment by scude...@gmail.com on 4 Apr 2012 at 2:52

GoogleCodeExporter commented 9 years ago

Original comment by michael.hale@gmail.com on 1 Feb 2013 at 4:38

Added labels: Milestone-3.0.x
Removed labels: Task-Unicode

GoogleCodeExporter commented 9 years ago

Original comment by michael.hale@gmail.com on 9 Apr 2013 at 7:34

Changed title: _Registry code needs converting to the new object model _

GoogleCodeExporter commented 9 years ago

Issue 220 has been merged into this issue.

Original comment by michael.hale@gmail.com on 9 Apr 2013 at 7:36

GoogleCodeExporter commented 9 years ago

Issue 476 has been merged into this issue.

Original comment by jamie.l...@gmail.com on 7 Mar 2014 at 4:29

GoogleCodeExporter commented 9 years ago

Issue 425 has been merged into this issue.

Original comment by jamie.l...@gmail.com on 7 Mar 2014 at 4:33

GoogleCodeExporter commented 9 years ago

Issue 358 has been merged into this issue.

Original comment by jamie.l...@gmail.com on 7 Mar 2014 at 4:34

GoogleCodeExporter commented 9 years ago

Issue 92 has been merged into this issue.

Original comment by jamie.l...@gmail.com on 7 Mar 2014 at 4:41

GoogleCodeExporter commented 9 years ago

Original comment by jamie.l...@gmail.com on 7 Mar 2014 at 7:29

GoogleCodeExporter commented 9 years ago

Original comment by mike.auty@gmail.com on 18 Feb 2015 at 6:52

Changed state: Duplicate

stephanelpaul / volatility

Registry code needs converting to the new object model #168