Closed arizvisa closed 11 months ago
Thanks a lot for the thorough analysis, and detailed explanation!
Indeed, ATM py_get_numbered_type
will attempt building a str
- which, as you mentioned, is conceptually wrong.
That was obviously not a problem in Py2, since both bytes
and str
were the same thing. But now it is a problem.
However, I'm worried that changing that now, might break other scripts that expect that the returned tuple contains strings, not bytes. Therefore, I wonder if the safest fix wouldn't be to allow Unicode replacement characters wherever there is a decoding issue.
Thoughts?
EDIT: I spoke too soon. It's only the comments that are retrieved as strings. I don't think that is conceptually wrong (do you think it is?) However, not being able to retrieve the entire type because of that, is not acceptable (and I'm looking into a fix)
…but then returning the comment as a string, means that set_numbered_type
won't work anymore since it expects a bytes
object. Hmm. I guess it means we must change it to a bytes
object as you suggested.
Hey Arnaud,
Yeah, I also believe that maybe changing to bytes
would be the best thing. But, I'm not sure if there's other places in IDAPython where the "cmt" and "fieldcmts" fields from tinfo_t
gets decoded by IDAPython. However, I think as long as we're _always_ passing bytes
as input to the tinfo_t
(via tinfo_t.deserialize
, set_numbered_type
, or perhaps another similar API) things should be okay.
In my own libraries, I always make sure to wrap those two APIs (tinfo_t.deserialize
and set_numbered_type
) with str.encode('utf-8')
so that they're always submitted as bytes
. This way when I retrieve them withtinfo_t.serialize
or get_numbered_type
they can be stored in a variable as either str
or bytes
before being encoded to bytes
when using it. I've been using it like this for a while and haven't personally encountered any issues... Hopefully this is the right call.
Agreed. I'll also fix get_named_type
btw, which suffers from a similar affliction.
@arizvisa fixed with https://github.com/idapython/src/commit/6142005952c489e684ad3aa8870b55b73baac90a
(Please contact us on support if you want a new build!)
So, I encountered a strange issue with the local types api. I'm not doing anything too special other than using the
tinfo_t
class which seems to require serialization/deserialization with some parts of it. Thetinfo_t.serialize
method facilitates this by returning a tuple containing the components that can then be used withtinfo_t.deserialize
and specifically theida_typeinf.get_numbered_type
functions.If I'm recalling correctly, during the Py2<->Py3 transition, the
tinfo_t.serialize
method seemed to have changed which required you to specially handle parts of the tuple that is returned. Specifically, the "cmt" fields were being returned as strings (str
) whereas all of the APIs that use them expect them to bebytes
. I've been working around this via a simple type-check and then encoding it when astr
is received.However, one of the types residing in an old database that I have seems to have gotten a non-utf8-encoded string assigned as one of these fields. This causes an issue with the
ida_typeinf.get_numbered_type
function in IDAPython as it immediately raises an instance of theUnicodeDecodeError
exception due to being unable to properly decode its fields.I'm not sure of the original cause (and it's likely user-error), but it's an easy thing to fix within a database. Still, I noticed that the
ida_typeinf.get_numbered_type
has become completely worthless as a result of this. It's significance is increased asida_typeinf.get_numbered_type
is the only way to get a local type via its ordinal number. So, if somebody shoots themselves in the foot (on purpose or by accident) and stashes a non-utf8 string as one of the type's fields, the type will have to be completely removed and re-created in order to access via this IDAPython API.Anyways, the
ida_typeinf.get_numbered_type
function is bound to the following code. This function,py_get_numbered_type
, does an implicit string conversion to UTF-8 as a result of the line at 603 where "ssi" is used as the format. According to the docs at https://docs.python.org/3/c-api/arg.html#strings-and-buffers,bytes
should be using "y" as its format. Unfortunately, there's also a Py2-compatibility issue here since the "y" format does not exist in Py2's flavor ofPy_BuildValue
which is worth consideration.https://github.com/idapython/src/blob/e1c108a7df4b5d80d14d8b0c14ae73b924bff6f4/pywraps/py_typeinf.hpp#L600-L611
For the sake of completion, here's what the components of that local type actually look like. That "
\xdd
" byte at the beginning of the fourth element of the following tuple is causing the UTF-8 decoding issue.I get that you guys have a support@, but since this is actually related to the swig bindings and is not really critical (since you can just delete and re-create the type entirely). Figured I'd post it here.