ghostwords / chameleon-crawler

Browser automation for Chameleon.
Mozilla Public License 2.0
19 stars 7 forks source link

Store fingerprinting scripts #7

Open cooperq opened 9 years ago

cooperq commented 9 years ago

It would be really great to be able to store a copy of all the scripts identified as fingerprinting scripts. That way we could see if any scripts are commonly being used by different attackers. This could also help us come up with heuristics if people are using similar tactics across the board.

ghostwords commented 9 years ago

Sorry for the late reply. Could you elaborate on "if any scripts are commonly being used by different attackers" a bit? Do you see us parsing script contents somehow?

cooperq commented 9 years ago

I mean, just a sha sum would do the trick. I think it's also worth reverse engineering any popular scripts to think about how we can build heuristics to detect them.

ghostwords commented 9 years ago

Absolutely!

Hashing: Ah, cool, that would help us in cases the same script goes by different filenames or is used by different domains. Perhaps we could also strip comments/whitespace when hashing to allow for trivial differences.

cooperq commented 9 years ago

I think stripping comments and whitespace is a great idea. This at least lets us discover if there are standard FP scripts floating around, which I suspect there are. Many people were using the same script for canvas based FP.

gunesacar commented 9 years ago

In addition to detecting common scripts, this could be very useful for post-crawl analysis. While going through the crawl results, we had many cases where suspicious scripts were changed, taken offline or simply missing on the pages once they were found to present.

Also, I think simhash and MOSS can be very useful for finding near-duplicate scripts. In addition to comments and whitespaces, scripts may include unique identifiers, timestamps or different endpoint URLs. As long as the scripts have very similar content, simhash would give the same digest and MOSS would give a very high similarity score.

cooperq commented 9 years ago

great ideas @gunesacar

ghostwords commented 9 years ago

Being able to access response bodies through the WebRequest API in Chrome will make this much easier to implement.