Browser attribute fingerprinting analysis [WIP]

14Richa commented 5 years ago

Jupyter notebook doing the analysis
Notes files to keep a list of threads and questions to follow

14Richa commented 5 years ago

Analysis in Pandas is the main file. BrowserAttributeFingerprinting.md contains my notes when doing Literature survey. Overview And Notes.md contains an overview and my understanding of the problem and dataset.

14Richa commented 5 years ago

Hey Sarah, Thanks for your suggestions, I am adding changes as I go. I thought few points were not clear in my analysis. I have added explanation here for them.

* Excellent introduction and write-ups
Thanks!

It is not sufficient to run this on just the sample file, please run on the 10% sample - now you have honed your analysis, this should be straight forward - only read in the columns you need and it should go pretty quickly. Done

You use df_plugins['script_url'].value_counts() to determine "which script is being used the most" - think about what you're actually counting here and how / why it might be biased. I am not actually counting which script is being used the most, I am interested to know which script is calling navigator.plugin and navigator.mimeTypes the most. Therefore I am doing this on already reduced dataset (df_plugins -- which contains only those rows with calls to above mentioned symbols). Hypothesis here is that this would highlight the scripts which are abusing these symbols to gather information about multiple plugins etc.

You end up stumped on the question that metrika js appears to be only looking at flash plugin based on res_df['symbol'].value_counts() but you've restricted your data to only be about plugins and mimeTypes - is that what you wanted? Yes, so my reasoning is something like this --> shortlist the scripts which query information on plugins, find scripts which use this query a lot and then see for which plugins do these scripts query. metrika.js is the top user of navigator.plugin and navigator.mimeTypes but it only queries about flash players and not other kind of plugins. That is why I am stuck.

You chose not to use 'Cwm fjordbank glyphs....' as heuristic for finding incidences of fingerprintjs because it is a panagram. What are other uses for panagrams in javascript, how would they show up in a dataset like this? Can you show me a script that is using 'Cwm fjordbank ....' but is not fingerprintjs or a slight modification of it? Yes, so what I am trying to do here is to find all instances of fingerprint.js or fingerprint2.js. There are other scripts as well which use 'Cwm fjordbank glyphs....' I do not want to focus on them for the current analysis. I just want to look at what fingerprint.js/fingerprint2.js is doing. Like I said in the notebook --- "Here I am interested in looking at all calls of fingerprint2.js. I want to understand what all arguments and values are associated with calls to fingerprint2.js. Can I infer a pattern with such calls and filter the calls to fingerprint2.js without explicitly looking for it?" I agree that we can find more scripts with the panagram but I am interested only in fingerprint.js and fingerprint2.js

You say "almost all calls are made around same time and it is querying a bunch of attributes to produce a hash" - what is the distribution of timestamps and how "close" are the timestamps you're referring to relative to the general distribution of timestamps. Interesting point, let me think more on this that how can I see the general distribution of timestamps. Do you know any visualization tools for this? I want something of a clustering but in time-space.

You say "I want to test if I just filter on rare symbols can I catch fingerprint.js calls? Hypothesis is that these rare calls to symbols is only done by fingerprinting scripts. As expected sessionStorage is pretty common followed by ShockWaveLength. The count reduces a lot for FingerPrint, doNotTrack and FuturesplashSuffixes." I don't feel you've made your case well, if at all. A statistical justification would certainly be possible. But more simply than that show me a bar chart (or something) with the average population prevalence compared to the fingerprint prevalence. (Then my follow-up question will be how does that compare to hs-analytics or akam.) I am littel confused here. My idea was simply that less common symbols would be called by fingerprinting scripts (assuming fingerprinting scripts are very less in number compared to clean scripts). I think plotting a graph of calls to FuturesplashSuffixes in general population vs reduced dataset (containing only fingerprinting scripts) can help checking this point.

"Some of these like cloudfront.net are CDNs and can be overlooked." Why can they be overlooked? These are content delivery networks, host to many files. Can't directly be blamed for serving fingerprinting files

Good work on the metrika detection. Thanks!

I don't see why you need the "domains" and the "base_url" as you work with dask you'll want to keep this processing to a minimum - pick one - probably doesn't matter too much which for now. Sure, will do.