filescan plugin has an error with unicode handling

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. run the filescan plugin's parse_string method on a Unicode object whose 
Length is larger than 256 bytes.

What is the expected output? What do you see instead?

We should see the full file name being returned. Instead we get a 256 byte 
truncated filename.

What version of the product are you using? On what operating system?

Current head of the Volatility trunk (specifically, version 1067).

Please provide any additional information below.

Problem can be fixed as follows:

At line 75 of filescan.py, we have the following code:

        return repr(string[:255].decode("utf16", "ignore").encode("utf8", "xmlcharrefreplace"))

Replacing the 255 with string_length should resolve this issue.

Original issue reported on code.google.com by carl.pulley on 9 Oct 2011 at 12:39

GoogleCodeExporter commented 8 years ago

Hmm good point. A file length should be limited to MAX_PATH characters (which 
is actually 260 and not 255 anyway, so that should be fixed). The other point 
you bring up is that many other plugins inherit from FileScan and use 
parse_string, and those other objects may not be limited to MAX_PATH.

Original comment by michael.hale@gmail.com on 9 Oct 2011 at 2:48

GoogleCodeExporter commented 8 years ago

Not in front of my computer right now. The parse string function is removed
in my branch in favour of Unicode string. This is possible by extending v()
to support a different as.

This might remove the need for special helper methods.

Original comment by scude...@gmail.com on 9 Oct 2011 at 8:10

GoogleCodeExporter commented 8 years ago

Ah yeah, that is an option. We should tie this in with Issue #87 also (where 
scudette originally described his replacement for parse_string and 
OBJECT_HEADER stuff). 

Only thing we'd have to check is make sure everything that uses parse_string 
through inheritance is really a UNICODE_STRING and not just an LPWSTR.

Original comment by michael.hale@gmail.com on 10 Oct 2011 at 3:08

GoogleCodeExporter commented 8 years ago

v() seems like a convenient solution, rather than a good one.  It's our problem 
that we instantiated an object that contains a pointer using a physical address 
location, and the object API is not the right place to compensate for that.

For most objects, asking for their value and handing in a different AS makes no 
sense (particularly without changing the offset).  And for most users having a 
parameter which is never used makes no sense.  Given that, and how it's nearly 
impossible to document well, it's not a good API to support.  This function is 
*only* useful for specific scenarios where the value of an object depends on 
dereferencing another.

I would much rather a way for objects to know which AS they'd been instantiated 
in, and whether a subobject dereference would work.  Since that would be quite 
complex, I think it would be better in the short term to have the helper 
function (formerly known as parse_string) corrected, and added to the 
UnicodeString object as an additional function, which will allow getting the 
value from the object when it's been defined in a physical address space.  It's 
currently called UnicodeString.virtual_v(virtual=vm), but I'm looking for a 
better name so please suggest some.

This would then make it clear to the user/developer what was going on, 
centralize and properly organize/locate the helper code and keep the rest of 
the API unharmed.

I've worked up an initial patch to remove parse_string support and would really 
like a code review of it.  It ditches FileScan.kernel_address_space, since 
carrying it around in the plugin seemed worse than passing the translated 
objects through to calculate.  It does still make use of _OBJECT_HEADER.kas, 
and it doesn't do anything concerning _OBJECT_TYPE.  It'll take me a little 
longer to port that from scudette's work.  It does change the calculate output 
from some plugins, but all of the inherited plugins overrode the calculate 
function.

Lastly, I'd also appreciate knowing how best to fix up the code taking into 
account MAX_PATH.  If someone knows which of the uses of parse_string may be 
LWPWSTRs instead of UNICODE_STRINGs please let me know...

Original comment by mike.auty@gmail.com on 16 Oct 2011 at 8:42

Attachments:

vol21-parse_string-removal.patch

GoogleCodeExporter commented 8 years ago

I disagree with this. We do need a wider discussion if we are to get a
better more flexible API.

Indeed the whole point of the API is about convenience. You can always
re-instantiate Object() over the top of different offsets as it is
currently done in the helper functions - but that makes it tedious and
error prone (as is evident in the different instances of the same
helper functions peppered around the place).  The point of the API is
to make common things seemless and I dare say that switching address
spaces is a very common operation in memory forensics.

It is often not our choice which address space we start in - e.g. file
scanners may start with physical and want to switch to virtual. There
is no "fault" as such - it just depends on the analysis technique.

I disagree. The v() function mean, evaluate this object in some
context - an AS is critical to the context. In many objects v()
returns a BaseObject instance, for pointers it mean dereference etc.
It makes sense to provide the address space in which it should be
evaluated right there. May I also remind you that way back in 1.2 and
1.3 days we instantiated every object with an AS parameter - so every
time we dereferenced a struct member or a pointer we provided an AS
and did switch from one to the other all the time.

That is the point of a default parameter. It is sometimes used, but often not.

I disagree. The meaning of v() is object dependent anyway for all
objects. For example for pointers it means to dereference the pointer
and returns an instance of the thing it was pointing to - for integers
it actually returns an integer (i.e. not a BaseObject instance at
all). The current API seems inconsistent enough - its still reasonably
documented and pretty obvious what we should get.

Indeed, the address space simply specifies the environment which
should be used to dereference it. Since an integer has the same value
regardless of address space it is not affected - however a
UNICODE_STRING does depend on the address space etc.

Objects do know which AS they are instantiated in. I am not sure what
you mean by subobject.

Would you also add additional methods to every other Object ? For example:

- A pointer would need to be derefered in a different AS - pointers
should get virtual_v().
- A linked list would most definitely need a virtual_v().
- A UNICODE_STRING would need a virtual_v().

The only thing which does not need a virtual_v() are the native types
(an integer has the same value in all AS's).

Would a call to virtual_v()  with no args be exactly equivalent to a
call to the current v()? and if so can we just alias one to the other
so we dont need to re-implement the same thing multiple times
increasing the maintenance costs?

I am pretty sure that if a developer wants to switch AS's they have a
reason to and understand why. If they dont understand what its for,
they are unlikely to provide this optional value. Lets give our
developers a bit more credit.

I am against adding new ad hoc methods to extend the API in
inconsistent ways. I would rather the API be extended in a logical
coherent manner - yet with backwards compatibility as much as
possible. I see the addition of a new method which is mostly the same
as another method but with a small tweak as an awkward API. This would
be ok in a language like C which does not provide optional parameters,
but in python this is just ugly.

Original comment by scude...@gmail.com on 16 Oct 2011 at 9:43

GoogleCodeExporter commented 8 years ago

I'm sensing that you disagree with this...  5:)

How did you get "evaluate this object in some context" from the single letter 
v()?  Even the docstring doesn't mention context.  The point of proxying the 
objects was so that people didn't even need to call v(), and as it currently 
stands v() is most commonly used to get the actual python object rather than a 
proxied one (and it's a shame that the v() function is needed at all).  There 
are currently no circumstances where v is context dependent, the object itself 
is supposed to maintain its context, and that's what I was trying to get across 
but seem to have done a bad job of communicating.

If we want to allow objects to be instantiated in ASes that they were not 
originally created in, then an object should either hold all the state about 
which ASes it can/should be interpreted in, or none of them.  Hanging onto one, 
that may be the wrong one, and then not even knowing which one it is, is not a 
good idea.

Currently we've got effectively two different types of object floating around, 
those that are instantiated in VAS, and those in PAS.  The PAS ones are broken, 
and should not allow any member to be dereferenced, but currently do.  It would 
be best if the object were aware about all the possible interpretations it 
could have (offset/vm for PAS, offset/vm for VAS and possibly even offset/vm 
for process space).  That way when it's asked for a member, it can figure out 
whether it has enough information to provide it or not.  There would be a 
specific way of telling an object which spaces it can exist in, and where it 
lives in each space.

If we don't think the objects should be carrying their own state around, but 
merely be templates to apply to an AS, then we should go back to explicitly 
providing the AS every time we ask for a value.  What we've got at the moment 
is a middle ground where we assumed we'd only instantiate valid objects in the 
AS they were created in, and that's worked so far, but is now causing this 
discussion.

The current API shouldn't be inconsistent, and if it were, that's not a good 
reason to continue on in the same way, it's a reason to fix it.  If you check 
the code, Pointer.v() returns the integer value of the offset it's pointing at, 
and I believe it has done since we first wrote that code.  
Pointer.dereference() returns the dereferenced object (and makes use of .v() to 
get the target offset).  As such this seems consistent to me, and indicates 
that we provide special functions to certain objects when they're required, 
such as pointers.

You're suggesting that there are three cases where we need support for 
switching ASes: 

- Pointers I'm relatively happy to add a parameter to dereference_as (and in 
fact, this is in the patch I provided), because it makes sense that a pointer 
in one space may point to an object in another, and it makes sense to add it to 
that function (although you might question why there's a dereference and a 
dereference_as, given your argument about default parameters).
- UnicodeString's v() is just another convenience shortcut so that people don't 
have to dereference the buffer themselves.  If they're working across two 
different ASes, then they should have to dereference it themselves, or use a 
second function to do so.  Overloading v() is just lazy, and in my opinion 
changes the meaning of v(), which seemed quite clear to me before this.
- If linked lists need the ability as well, then it should be evaluated and 
added as necessary.  I think three types across all the types that are normally 
accessed (every different variety of NativeType and every different variety of 
CType save for UNICODE_STRING) is a tiny amount, and not a good enough reason 
to add a dummy parameter to every type in existence.  It's skirting around the 
problem, rather than designing a good way of handling it.

At the moment a call to virtual_v with no arguments would be equivalent to 
.v(), however we can require that an AS always be provided to virtual_v() if 
you'd prefer to distinguish this specialist helper function from the standard 
v() function.  If you want to alias them for that class then you will find 
people running into "differing numbers of argument" errors, and I've already 
explained why I'm against adding it to every single type in existence for the 
sake of three specialist cases.

If a developer wants to switch ASes, they currently have to overlay the object. 
 With either of our solutions they call an additional function and hand in the 
AS, so on that front there's no difference between us.

Having the ability to add ad-hoc methods to objects when they required it is 
exactly why object classes exist, and precisely why every object doesn't have a 
"get_object_type" function, or a "get_process_address_space" function.  These 
are specialist functions not required by all, and therefore best defined in the 
objects that need them.

However, since it's clear we differ on this matter, it's probably best to get 
some external input.  Anyone else, comments, suggestions, etc please?

Original comment by mike.auty@gmail.com on 16 Oct 2011 at 10:50

GoogleCodeExporter commented 8 years ago

The parse_string function has now been removed in favour of making use of the 
UNICODE_STRING implementation.  As such, the length issue on this bug is fixed, 
marking as such.

Original comment by mike.auty@gmail.com on 22 Jan 2012 at 6:35

Changed state: Fixed

goodnam / volatility

filescan plugin has an error with unicode handling #151