riverloopsec / hashashin

Hashashin: A Fuzzy Matching Tool for Binary Ninja
MIT License
85 stars 7 forks source link

make a binary ninja plugin #10

Open psifertex opened 1 year ago

psifertex commented 1 year ago

Are ya'll interested in making this available as a binary ninja plugin? You could actually specify a requirements.txt that would pull and install as a pip dependency right from the repo as I understand it. That would make it more discoverable in the plugin manager.

tst-rmspeers commented 1 year ago

Yes, I think we are, although with this as a library, would that be OK to include in your plugins manager? Or would you prefer we instead add packaged tools that use this library?

I expect if you're good with the library that could still be useful to people for sure who build their own tooling on it after a plugin install. We just want to make sure the description is clear what it is/isn't.

CCing @jprokos26 for awareness

psifertex commented 1 year ago

I could go either way, I'm mostly focused on raising awareness so just having it in there helps, but having a simple UI wouldn't hurt and I don't mind even submitting a PR with a simple one if that would help.

jprokos26 commented 1 year ago

I've made a pretty simple plugin here: 79f075f

It requires hashashin to be installed for BinaryNinja's python.interpreter and can either embed the extracted function feature map as a comment using Hashashin Feature Extraction or can compute the full Binary Signature using Hashashin Signature Generation which computes the features for every function and stores the computed signature object in the session data which can be accessed with bs = bv.session_data.BinarySignature.

@psifertex Three main questions to make this a useful plugin:

  1. A major problem with the current state of the plugin is that when calling the signature generation it freezes binary ninja completely until it is finished with its calculation (~15-20 seconds for busybox), do you have any tips of how to run this analysis in the background? I'm thinking along the lines of forking & updating the view after the plugin returns but am not sure if binja allows this.
  2. The UI element of this is non-existent at this point, how would you recommend this information to be displayed to the user such that it is most useful to their workflow? I.e. should I look into creating a second window whose sole purpose is to display the feature map and/or generated signature or does it make more sense to store this information in tags/comments? If you have other ideas of how to display this information that would be a great help.
  3. The DB query aspect of this platform utilizes a pinned extraction version so I cannot just use the same db across the plethora of versions (since the extraction engine could change between versions). What sort of assumptions can I make about the minor version differences in Binja? I do not use HLIL at all during extraction but rely pretty heavily on MLIL, can I expect functions to be lifted to the same disassembly and MLIL across minor versions?

The feature map currently looks like this and is stored as a comment at the top of the function:

{'cyclomatic_complexity': 1,
  'num_instructions': 13,
  'num_strings': 1,
  'max_string_length': 14,
  'vertex_histogram': [1, 2, 0],
  'edge_histogram': [2, 0, 0, 0],
  'instruction_histogram':
 0|3|0|0|0|0|2|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0,
  'dominator_signature': 0x1a,
  'constants': [1, 3, 510, 55952, 80308, 80312, 553101, 637184],
  'strings': ['stack overflow']
}

Note that dominator_signature can be a very large number (largest in busybox is on the order of 5.4E+185) and constants can be quite long as well. Internally we wrap these values when pushing to disk but the object passed back to binja has the full length.

psifertex commented 1 year ago

1) Yes, you just need to make a BackgroundTaskThread. Here's an example.

2) In terms of annotations, it depends on the goals. I would look at the BD Viewer plugin as an example of how you can present matched data and offer to, for example, port symbols or type information. In fact, the BSI project some other folks there have been working on has a pretty robust UI for doing similar workflows and might be worth taking a look at, though I don't believe the current implementation is available under an open source license but I wouldn't mind lobbying for that if it helps. 😉

3) No, unfortunately you CANNOT rely at all on MLIL being stable across versions. In fact, you can't even rely on it being stable in the same version! Given new type information or other changes such as functions being added or removed, analysis can easily change such that MLIL is not constant. Sometimes even depending on analysis races it's possible for changes to occur even without the above! This usually happens when analysis depends on the order of analysis of other functions and while we try to stamp out sources of non-determinism like this, we cannot guarantee they do not exist.

For this reason we generally recommend either pinning on specific features or dynamically computing specific IL offsets on demand.

The feature map itself looks fine, just so long as it has the ability to handle when the ground underneath it shifts somewhat. 😬

Let me know if I missed anything!