py_get_numbered_type: badly-formed utf-8 being stored within a type can result in a UnicodeDecodeError on Py3.

arizvisa commented 11 months ago

So, I encountered a strange issue with the local types api. I'm not doing anything too special other than using the tinfo_t class which seems to require serialization/deserialization with some parts of it. The tinfo_t.serialize method facilitates this by returning a tuple containing the components that can then be used with tinfo_t.deserialize and specifically the ida_typeinf.get_numbered_type functions.

If I'm recalling correctly, during the Py2<->Py3 transition, the tinfo_t.serialize method seemed to have changed which required you to specially handle parts of the tuple that is returned. Specifically, the "cmt" fields were being returned as strings (str) whereas all of the APIs that use them expect them to be bytes. I've been working around this via a simple type-check and then encoding it when a str is received.

However, one of the types residing in an old database that I have seems to have gotten a non-utf8-encoded string assigned as one of these fields. This causes an issue with the ida_typeinf.get_numbered_type function in IDAPython as it immediately raises an instance of the UnicodeDecodeError exception due to being unable to properly decode its fields.

I'm not sure of the original cause (and it's likely user-error), but it's an easy thing to fix within a database. Still, I noticed that the ida_typeinf.get_numbered_type has become completely worthless as a result of this. It's significance is increased as ida_typeinf.get_numbered_type is the only way to get a local type via its ordinal number. So, if somebody shoots themselves in the foot (on purpose or by accident) and stashes a non-utf8 string as one of the type's fields, the type will have to be completely removed and re-created in order to access via this IDAPython API.

Anyways, the ida_typeinf.get_numbered_type function is bound to the following code. This function, py_get_numbered_type, does an implicit string conversion to UTF-8 as a result of the line at 603 where "ssi" is used as the format. According to the docs at https://docs.python.org/3/c-api/arg.html#strings-and-buffers, bytes should be using "y" as its format. Unfortunately, there's also a Py2-compatibility issue here since the "y" format does not exist in Py2's flavor of Py_BuildValue which is worth consideration.

https://github.com/idapython/src/blob/e1c108a7df4b5d80d14d8b0c14ae73b924bff6f4/pywraps/py_typeinf.hpp#L600-L611

For the sake of completion, here's what the components of that local type actually look like. That "\xdd" byte at the beginning of the fourth element of the following tuple is causing the UTF-8 decoding issue.

(b'\r1\x07\x07\x07\x07\x07\x07', b'\nv_flags_0\rv_subIndex_4\x15v_previousSubIndex_8\x12v_currentlyUsed_c\x0ev_subIndex_10\x0fv_subOffset_14', None, b"\xdd\x01[value&00000100] means that it uses gv_wmmTableSpace_1b3d0\n[value&00000010] makes the top 8 bits of this value a multiplier\n[value&FF000000] seems to be the chunk size of the arena?\n[value&00000001] means that it's inuse\x01\x01\x011[note] this index seems to be the external index", 0x0)

I get that you guys have a support@, but since this is actually related to the swig bindings and is not really critical (since you can just delete and re-create the type entirely). Figured I'd post it here.

aundro commented 11 months ago

Thanks a lot for the thorough analysis, and detailed explanation! Indeed, ATM py_get_numbered_type will attempt building a str - which, as you mentioned, is conceptually wrong. That was obviously not a problem in Py2, since both bytes and str were the same thing. But now it is a problem.

However, I'm worried that changing that now, might break other scripts that expect that the returned tuple contains strings, not bytes. Therefore, I wonder if the safest fix wouldn't be to allow Unicode replacement characters wherever there is a decoding issue.

Thoughts?

EDIT: I spoke too soon. It's only the comments that are retrieved as strings. I don't think that is conceptually wrong (do you think it is?) However, not being able to retrieve the entire type because of that, is not acceptable (and I'm looking into a fix)

aundro commented 11 months ago

…but then returning the comment as a string, means that set_numbered_type won't work anymore since it expects a bytes object. Hmm. I guess it means we must change it to a bytes object as you suggested.

arizvisa commented 11 months ago

Hey Arnaud,

Yeah, I also believe that maybe changing to bytes would be the best thing. But, I'm not sure if there's other places in IDAPython where the "cmt" and "fieldcmts" fields from tinfo_t gets decoded by IDAPython. However, I think as long as we're _always_ passing bytes as input to the tinfo_t (via tinfo_t.deserialize, set_numbered_type, or perhaps another similar API) things should be okay.

In my own libraries, I always make sure to wrap those two APIs (tinfo_t.deserialize and set_numbered_type) with str.encode('utf-8') so that they're always submitted as bytes. This way when I retrieve them withtinfo_t.serialize or get_numbered_type they can be stored in a variable as either str or bytes before being encoded to bytes when using it. I've been using it like this for a while and haven't personally encountered any issues... Hopefully this is the right call.

aundro commented 11 months ago

Agreed. I'll also fix get_named_type btw, which suffers from a similar affliction.

aundro commented 11 months ago

@arizvisa fixed with https://github.com/idapython/src/commit/6142005952c489e684ad3aa8870b55b73baac90a

(Please contact us on support if you want a new build!)

idapython / src

py_get_numbered_type: badly-formed utf-8 being stored within a type can result in a UnicodeDecodeError on Py3. #57