Closed toxyl closed 2 years ago
Perceptual hashes as in the typical algorithms for images can't be used directly. But creating some sort of hash function where similar inputs yield similar outputs should be doable for text. You could also look into inverted indexes which are often used for full text search.
What is the current goal behind the identity hashes? Why do we want to know if attacks are similar?
Currently the goal is collection, but I already see that my instances collect a lot that is very similar. Which makes analysis somewhat annoying because you keep viewing files with roughly the same content but you just don't know before you open them. Long-term I could imagine the collections being a useful data source to help in defense. I've seen payloads similar to what I see on oSSH, but then in HTTP requests, so one could e.g. hash each parameter value and compare it with the hashes of known payloads to filter them out. With the current SHA1 hashing approach one can detect exact matches, but a slight variation (like a different password in the HiveOS payload) would easily slip through. Basically, think EOKM here ;)
Right, so it is mostly about grouping things together. Perhaps something in here which can help: https://en.wikipedia.org/wiki/Content_similarity_detection
Starting with the wiki article I ended up with LSH Forest as an interesting candidate. I'll test it one of these days.
I've captured, e.g., a bunch of payloads to attack HiveOS where the only difference between payloads is the password used but with SHA1 hashes I can't easily identify them without scanning each payload that I have. Would it be possible to use perceptual hashes instead to identify payloads? That might help to categorize payloads and detect relations between them.