Closed GoogleCodeExporter closed 9 years ago
I can confirm this. Here's some more information.
The error message is a little misleading:
File "/home/tb/work/forensics/volatility-2.0/volatility/win32/hashdump.py",
line 319, in dump_hashes
lmhash.encode('hex'), nthash.encode('hex'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-12:
ordinal not in range(128)
The lmash.encode('hex') and nthash.encode('hex') lines are just fine. Its
really crashing on the first part of that statement which isn't shown:
yield "{0}:{1}:{2}:{3}:::".format(get_user_name(user), int(str(user.Name), 16),
lmhash.encode('hex'), nthash.encode('hex'))
The value returned by get_user_name(user) is what contains the characters
causing the UnicodeEncodeError. Just FYI these values can be printed fine, but
not formatted. For example, I modified my copy of:
http://code.google.com/p/volatility/source/browse/trunk/volatility/win32/hashdum
p.py#318
And instead of the above yield statement, I made it do this instead:
the_rest = ":{0}:{1}:{2}:::".format(int(str(user.Name), 16),
lmhash.encode('hex'), nthash.encode('hex'))
print get_user_name(user), the_rest
Now when I run the plugin on a russian memory dump, I see:
Администратор
:500:aad3b435b51404eeaad3b435b51404ee:31d6cfe0d16ae931b73c59d7e0c089c0:::
Hopefully that shows up correctly here in google code. For other people reading
this issue, the value on the far left is "Administrator" in russian.
Ikelos, Scudette, do you have any thoughts on how to format user names in these
other languages properly?
Original comment by michael.hale@gmail.com
on 14 Nov 2011 at 3:37
Welcome to the hell that is unicode :-)
What we are seeing here is a difference in behaviour between the python 2
string interpolation mechanism and the new 3.0 (really 2.6+) format mechanism.
The behaviour is different when interpolating mixed unicode and string objects
together. For example:
a="%s %s" % (u"Администратор", "hello")
print type(a)
<type 'unicode'>
The single unicode parameter is causing the format string to be coerced into
unicode type.
On the other hand this:
a="{0} {1}".format(u"Администратор", "hello")
raises <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode
characters in position 0-12: ordinal not in range(128)
So the new behaviour forces the first parameter down into string, rather then
the format string into unicode. This is a better choice because it causes
ambiguous code like above to break early and more frequently (so it can be
fixed).
The solution is to explicitely make the format string a unicode object - e.g.
a=u"{0} {1}".format(u"Администратор", "hello")
Of course this will break if the second arg can not be encoded in utf8
correctly, so we need to be careful.
In the above this fix should work:
yield u"{0}:{1}:{2}:{3}:::".format(get_user_name(user), int(str(user.Name), 16),
lmhash.encode('hex'), nthash.encode('hex'))
Original comment by scude...@gmail.com
on 14 Nov 2011 at 4:26
Thanks for the explanation Scudette!
Hmm when I try yield u"" that works but there's a UnicodeEncodeError when the
yielded unicode string is passed to outfd.write here:
outfd.write(d + "\n")
http://code.google.com/p/volatility/source/browse/trunk/volatility/plugins/regis
try/lsadump.py#101
I changed it to outfd.write(d + u"\n") and also put the line break at the end
of the yielded unicode string and then just did outfd.write(d). They both still
cause an exception.
This is odd since type(d) is obviously unicode and outfd.write(u"testing")
alone has no issues.
Original comment by michael.hale@gmail.com
on 14 Nov 2011 at 4:47
Mike, the secret to understanding unicode in python is as follows:
- A unicode object is an abstract object which can only exist inside the
program. It is not real and has no physical representation.
- When you need to send or receive a unicode object from your program you must
represent it in a concrete way (i.e. in bytes). This is called encoding or
decoding. You must explicitely encode or decode (because the default is and
implicit ascii conversion - which will break frequently).
So in this case the unicode object lives inside the bowls of volatility until
it hits the line outfd.write(d + "\n"). At this point it must be encoded to
bytes (you can not write a unicode object to a physical thing like the terminal
or the network etc).
So you must encode it on output (and decode it on input):
outfd.write(d.encode("utf8") + "\n"). We typically wrap these calls with
helpers:
outfd.write(SmartStr(d+"\n"))
Where SmartStr guarantees that the output will be a string encoded to utf8 if
necessary. These calls must only ever exist on interfaces from the program
(i.e. when data is leaving the program it must be encoded - when its arriving
into the program it must be decoded - all data internally must be unicode
objects). This makes it easy to see the external interfaces for your code.
Note that outfd.write(d + u"\n") is the same as outfd.write(d + "\n") because
if d is a unicode object, the "\n" will automatically be converted to unicode
anyway.
Original comment by scude...@gmail.com
on 14 Nov 2011 at 5:08
Got it, thanks for teaching me something new ;-)
The outfd.write(d.encode("utf8") + "\n") indeed does work.
But regarding the SmartStr(), when you say "We typically wrap these calls with
helpers" do you mean in other projects? I haven't seen it done in Volatility so
just wondering if I missed it and if there's already a SmartStr()-like function
defined somewhere.
Also I suppose we need to do a more thorough analysis of what other plugins may
need this type of helper function (like filescan, mutantscan, etc).
Original comment by michael.hale@gmail.com
on 15 Nov 2011 at 3:49
Saw this on twitter and may be a good reference (scudette probably knows it
already, but for others):
http://farmdev.com/talks/unicode/
Original comment by michael.hale@gmail.com
on 18 Jan 2012 at 3:15
Original comment by mike.auty@gmail.com
on 10 Mar 2012 at 11:38
Ok, so this is mostly being caused by the registry code relying on vm.reads and
structs for parsing, rather than using the Object model. As such, here's an
interim patch for review (it's not a good solution, but it should reduce the
errors and allow the plugins to complete), and I've updated the summary to
reflect that the registry code needs modernizing...
Original comment by mike.auty@gmail.com
on 22 Mar 2012 at 12:46
Attachments:
Hey Mike,
Thanks for the patch, there are a few related things that need fixing before I
can fully test it. One of the small fixes I applied in r1571.
Here's the other one (hashdump):
Volatile Systems Volatility Framework 2.1_alpha
Traceback (most recent call last):
File "vol.py", line 173, in <module>
main()
File "vol.py", line 164, in main
command.execute()
File "/Users/mhl/volatility-hashdump/volatility/commands.py", line 101, in execute
func(outfd, data)
File "/Users/mhl/volatility-hashdump/volatility/plugins/registry/lsadump.py", line 98, in render_text
for d in data:
File "/Users/mhl/volatility-hashdump/volatility/win32/hashdump.py", line 314, in dump_hashes
lmhash, nthash = get_user_hashes(user, hbootkey)
TypeError: 'NoneType' object is not iterable
So the problem is hashdump.get_user_hashes() returns None or a tuple. The
calling function expects a tuple:
lmhash, nthash = get_user_hashes(user, hbootkey)
So I would fix this myself, but I'm not sure the best way to do it. Should we
return None, None or call get_user_hashes in a try/except block while catching
TypeError?
Here are few discussions - what's the recommended way for volatility code?
http://stackoverflow.com/questions/3448701/function-returning-a-tuple-or-none-ho
w-to-call-that-function-nicely
http://stackoverflow.com/questions/1274875/returning-none-or-a-tuple-and-unpacki
ng
Original comment by michael.hale@gmail.com
on 23 Mar 2012 at 6:21
Hi Mike,
I think the function should either return an empty tuple or raise
an exception, depending on what can be done about it. I am leaning
towards just returning an empty tuple.
Michael.
Original comment by scude...@gmail.com
on 23 Mar 2012 at 10:17
Well, we're going to have to do the check somewhere, options are:
a, b = func()
if not a or not b:
blowup()
dostuff()
ret = func()
if not ret:
blowup()
a, b = ret
dostuff()
try:
a, b = func()
dostuff()
except SpecificException:
blowup()
It's a style choice, I'd probably go for the second or first option, but even
the third option would be fine as long as you're catching a very specific
exception. Your choice...
Original comment by mike.auty@gmail.com
on 24 Mar 2012 at 12:32
Hey guys, in r1587 I fixed the issues discussed in comment 9-11. The patch from
Ikelos in comment 8 seems fine with my testing so far.
So there are no more UnicodeEncodeErrors which is great. Regarding the russian
user name in hashdump output, now you're going to see:
?????:501:<HASH HERE>
So eliminated the unicode decode errors by converting the non-printable chars
to '?' but what if someone really wants/needs to see
Администратор? What options do they have?
Original comment by michael.hale@gmail.com
on 2 Apr 2012 at 4:23
Please note some of this discussion was picked up in the comments of r1587:
http://code.google.com/p/volatility/source/detail?r=1587
Original comment by michael.hale@gmail.com
on 4 Apr 2012 at 1:56
As one of the enhancements when updating the registry code, we should change
these lines in printkey.py:
if tp == 'REG_BINARY':
dat = "\n" + "\n".join(["{0:#010x} {1:<48} {2}".format(o, h, ''.join(c)) for o, h, c in utils.Hexdump(dat)])
if tp in ['REG_SZ', 'REG_EXPAND_SZ', 'REG_LINK']:
dat = dat.encode("ascii", 'backslashreplace')
if tp == 'REG_MULTI_SZ':
We should use:
if tp == 'REG_BINARY':
;
elif tp in [....]:
;
elif tp == 'REG_MULTI_SZ'
Original comment by michael.hale@gmail.com
on 4 Apr 2012 at 1:57
Gleeda do you want to try and get your registry API stuff in before 2.1? If so,
this might be a good time since there will be at least one good round of
registry cleanup happening soon.
Original comment by michael.hale@gmail.com
on 4 Apr 2012 at 1:58
Yes, I would :-) Should I create another issue for that?
Original comment by jamie.l...@gmail.com
on 4 Apr 2012 at 2:08
You might note that I am currently undergoing a rewrite of the registry support
in my branch so it might make sense to look there first.
Also, I am not sure at all that unicode support is really addressed at the
moment in trunk. I think its going to be hard to properly fix it until we have
some way to enforce an output encoding. For example in the new framework we
wrap stdout in a class which ensures utf8 encoding (otherwise it will raise
when we have stdout being a pipe).
Original comment by scude...@gmail.com
on 4 Apr 2012 at 2:52
Original comment by michael.hale@gmail.com
on 1 Feb 2013 at 4:38
Original comment by michael.hale@gmail.com
on 9 Apr 2013 at 7:34
Issue 220 has been merged into this issue.
Original comment by michael.hale@gmail.com
on 9 Apr 2013 at 7:36
Issue 476 has been merged into this issue.
Original comment by jamie.l...@gmail.com
on 7 Mar 2014 at 4:29
Issue 425 has been merged into this issue.
Original comment by jamie.l...@gmail.com
on 7 Mar 2014 at 4:33
Issue 358 has been merged into this issue.
Original comment by jamie.l...@gmail.com
on 7 Mar 2014 at 4:34
Issue 92 has been merged into this issue.
Original comment by jamie.l...@gmail.com
on 7 Mar 2014 at 4:41
Original comment by jamie.l...@gmail.com
on 7 Mar 2014 at 7:29
Original comment by mike.auty@gmail.com
on 18 Feb 2015 at 6:52
Original issue reported on code.google.com by
olbioua%...@gtempaccount.com
on 14 Nov 2011 at 2:46