cmd/pi-strings-go: add Go version of pi-strings

lapsang-boys / pippi

A modular, extensible and collaborative reverse engineering ecosystem

https://pippi.re

BSD 2-Clause "Simplified" License

7 stars 1 forks source link

cmd/pi-strings-go: add Go version of pi-strings #18

Closed mewmew closed 5 years ago

mewmew commented 5 years ago

The intention with pi-strings-go is not to replace the Rust version of pi-strings but to provide multiple implementations of the same Protobuf API, such that different implementations may be evaluated and compared for performance, feature sets, etc.

Once we get a large enough set of non-trivial compoents it will be fun to evaluate Rust vs. Go in terms of performance for more intricate reversing sessions where the garbage collector may kick in high drive.

~Note: this PR depends on #17 (and #20 for test cases).~

mewmew commented 5 years ago

Rebased and ready for review.

Do we have a better way to handle dependent PRs?

karlek commented 5 years ago

Rebase after https://github.com/lapsang-boys/pippi/pull/17 and then I'll review

mewmew commented 5 years ago

Rebase after #17 and then I'll review

done

Edit: officially done as per caffd2e943491c48e5dc911ca1589aec0a794da8

mewmew commented 5 years ago

This PR is now functionally complete.

However, it's performance is really poor after switching to x/text/encoding. We should check if there are any easy wins to make to gain back this performance.

Edit: update, the performance for anything but toy inputs is poor to the point of not usable. Just to make this explicit.

mewmew commented 5 years ago

Now this PR is feature complete and also performant enough to be used for larger programs.

The pi-strings-go command currently implements support for the following string encodings:

UTF-8
UTF-16 (little and big endian, with and without BOM)
UTF-32 (little and big endian, with and without BOM)
And Big Hero 5 :)

@karlek, review at your own leisure :)

karlek commented 5 years ago

Would be interesting to add a probability of being a string based on where the string was found. If it was found inside an image, it probably isn't a string (unless it's inside a malware program, than anything can be anything)

mewmew commented 5 years ago

Would be interesting to add a probability of being a string based on where the string was found. If it was found inside an image, it probably isn't a string (unless it's malware, than it can be anything)

Definitely! I think we can do a lot with probability, e.g. for longer strings, histogram of characters, n-grams and then filter/hide strings that are most likely binary data, and highlight the "real" strings.

Edit: this could also be based on language detection, if Pippi finds a lot of words in German in a given binary, then assume german, and update the frequency of the histograms accordingly.

karlek commented 5 years ago

Would be interesting to add a probability of being a string based on where the string was found. If it was found inside an image, it probably isn't a string (unless it's malware, than it can be anything)

Definitely! I think we can do a lot with probability, e.g. for longer strings, histogram of characters, n-grams and then filter/hide strings that are most likely binary data, and highlight the "real" strings.

Exactly! Improve the user experience, by sorting it more cleverly. I'd love that!

mewmew commented 5 years ago

Updated #19 based on this PR.