mozilla / overscripted

Repository for the Mozilla Overscripted Data Mining Challenge
Mozilla Public License 2.0
74 stars 53 forks source link

Browser attribute fingerprinting analysis [WIP] #78

Closed 14Richa closed 4 years ago

14Richa commented 5 years ago
  1. Jupyter notebook doing the analysis
  2. Notes files to keep a list of threads and questions to follow
14Richa commented 5 years ago

Analysis in Pandas is the main file. BrowserAttributeFingerprinting.md contains my notes when doing Literature survey. Overview And Notes.md contains an overview and my understanding of the problem and dataset.

14Richa commented 5 years ago

Hey Sarah, Thanks for your suggestions, I am adding changes as I go. I thought few points were not clear in my analysis. I have added explanation here for them.

* Excellent introduction and write-ups

Thanks!

  • It is not sufficient to run this on just the sample file, please run on the 10% sample - now you have honed your analysis, this should be straight forward - only read in the columns you need and it should go pretty quickly. Done
  • You use df_plugins['script_url'].value_counts() to determine "which script is being used the most" - think about what you're actually counting here and how / why it might be biased. I am not actually counting which script is being used the most, I am interested to know which script is calling navigator.plugin and navigator.mimeTypes the most. Therefore I am doing this on already reduced dataset (df_plugins -- which contains only those rows with calls to above mentioned symbols). Hypothesis here is that this would highlight the scripts which are abusing these symbols to gather information about multiple plugins etc.
  • You end up stumped on the question that metrika js appears to be only looking at flash plugin based on res_df['symbol'].value_counts() but you've restricted your data to only be about plugins and mimeTypes - is that what you wanted? Yes, so my reasoning is something like this --> shortlist the scripts which query information on plugins, find scripts which use this query a lot and then see for which plugins do these scripts query. metrika.js is the top user of navigator.plugin and navigator.mimeTypes but it only queries about flash players and not other kind of plugins. That is why I am stuck.
  • You chose not to use 'Cwm fjordbank glyphs....' as heuristic for finding incidences of fingerprintjs because it is a panagram. What are other uses for panagrams in javascript, how would they show up in a dataset like this? Can you show me a script that is using 'Cwm fjordbank ....' but is not fingerprintjs or a slight modification of it? Yes, so what I am trying to do here is to find all instances of fingerprint.js or fingerprint2.js. There are other scripts as well which use 'Cwm fjordbank glyphs....' I do not want to focus on them for the current analysis. I just want to look at what fingerprint.js/fingerprint2.js is doing. Like I said in the notebook --- "Here I am interested in looking at all calls of fingerprint2.js. I want to understand what all arguments and values are associated with calls to fingerprint2.js. Can I infer a pattern with such calls and filter the calls to fingerprint2.js without explicitly looking for it?" I agree that we can find more scripts with the panagram but I am interested only in fingerprint.js and fingerprint2.js
  • You say "almost all calls are made around same time and it is querying a bunch of attributes to produce a hash" - what is the distribution of timestamps and how "close" are the timestamps you're referring to relative to the general distribution of timestamps. Interesting point, let me think more on this that how can I see the general distribution of timestamps. Do you know any visualization tools for this? I want something of a clustering but in time-space.
  • You say "I want to test if I just filter on rare symbols can I catch fingerprint.js calls? Hypothesis is that these rare calls to symbols is only done by fingerprinting scripts. As expected sessionStorage is pretty common followed by ShockWaveLength. The count reduces a lot for FingerPrint, doNotTrack and FuturesplashSuffixes." I don't feel you've made your case well, if at all. A statistical justification would certainly be possible. But more simply than that show me a bar chart (or something) with the average population prevalence compared to the fingerprint prevalence. (Then my follow-up question will be how does that compare to hs-analytics or akam.) I am littel confused here. My idea was simply that less common symbols would be called by fingerprinting scripts (assuming fingerprinting scripts are very less in number compared to clean scripts). I think plotting a graph of calls to FuturesplashSuffixes in general population vs reduced dataset (containing only fingerprinting scripts) can help checking this point.
  • "Some of these like cloudfront.net are CDNs and can be overlooked." Why can they be overlooked? These are content delivery networks, host to many files. Can't directly be blamed for serving fingerprinting files
  • Good work on the metrika detection. Thanks!
  • I don't see why you need the "domains" and the "base_url" as you work with dask you'll want to keep this processing to a minimum - pick one - probably doesn't matter too much which for now. Sure, will do.
birdsarah commented 5 years ago

(I'm replying one at a time as I'm on my phone)

You use df_plugins['script_url'].value_counts() to determine "which script is being used the most" - think about what you're actually counting here and how / why it might be biased.

I am not actually counting which script is being used the most

Your words said "the most," so that's what I read. Definitely focus on being specific.

I am interested to know which script is calling navigator.plugin and navigator.mimeTypes the most. Therefore I am doing this on already reduced dataset (df_plugins -- which contains only those rows with calls to above mentioned symbols). Hypothesis here is that this would highlight the scripts which are abusing these symbols to gather information about multiple plugins etc.

What I want you to think about is this: you are using number of rows to make inferences. What does that mean in the dataset? What do the rows represent? And what you can infer if that's what you choose to count?

birdsarah commented 5 years ago

You chose not to use 'Cwm fjordbank glyphs....' as heuristic for finding incidences of fingerprintjs because it is a panagram. What are other uses for panagrams in javascript, how would they show up in a dataset like this? Can you show me a script that is using 'Cwm fjordbank ....' but is not fingerprintjs or a slight modification of it?

Yes, so what I am trying to do here is to find all instances of fingerprint.js or fingerprint2.js. There are other scripts as well which use 'Cwm fjordbank glyphs....' I do not want to focus on them for the current analysis.

It's not a huge deal for your analysis but to be clear about the point I'm trying to make: the goal was to find instances of this library. Which is a more likely: that a developer has kept the name fingerprint.js is, or that they have kept using their methodology of using "Cwm fjordbank ...." In fairness, I never provided evidence that the "Cwm fjordbank...." lookup is superior, but similarly you haven't demonstrated that all instances of scripts names "fingerprint.js" are the correct library.

I don't particularly want you to change anything but to think critically about the choices you are making.

birdsarah commented 5 years ago

Do you know any visualization tools for this? I want something of a clustering but in time-space.

Nothing springs to mind hisogram type things should be good enough for thinking about distributions.

birdsarah commented 5 years ago

You say "I want to test if I just filter on rare symbols can I catch fingerprint.js calls? Hypothesis is that these rare calls to symbols is only done by fingerprinting scripts. As expected sessionStorage is pretty common followed by ShockWaveLength. The count reduces a lot for FingerPrint, doNotTrack and FuturesplashSuffixes." I don't feel you've made your case well, if at all. A statistical justification would certainly be possible. But more simply than that show me a bar chart (or something) with the average population prevalence compared to the fingerprint prevalence. (Then my follow-up question will be how does that compare to hs-analytics or akam.)

I am littel confused here. My idea was simply that less common symbols would be called by fingerprinting scripts (assuming fingerprinting scripts are very less in number compared to clean scripts).

There's a quite a few assumptions in ideas here and you haven't made a case for any of them. Let's unpack them

(1) "less common symbols would be called by fingerprinting scripts" - I don't believe that to be true, but you certainly could present evidence to make that point and that would be interesting to see

(2) "assuming fingerprinting scripts are very less in number compared to clean scripts" (a) first going back to my earlier point make sure you're clear on what you're counting and whether it helps you find you the information you want (b) I really don't see how the commonness of fingerprinting scripts relates to the the relative frequency of symbol calls by those scripts (c) how are you going to separate clean scripts from fingerprinting scripts to answer this question. Is everything that is not a fingerprinting script "clean"? What about all the scripts we haven't identified yet.

birdsarah commented 5 years ago

These are content delivery networks, host to many files. Can't directly be blamed for serving fingerprinting files

You need to make this justification in your writing not to me.

Just as a thought experiment: If you are going to take the position that CDNs cannot be "blamed" for fingerprinting scripts then all fingerprinters would just move their content to a CDN. What should we do in that case if we want to stop fingerprinting?

birdsarah commented 5 years ago

I think my comments are written more negatively than I intend, because I don't intend them negatively at all. There's a LOT to dig into here and you're well on your way.

In particular, I have deleted "You are missing my point." - that is not helpful language to use on my part and I apologize.

14Richa commented 5 years ago

Added a new file --- Analysis in dask. It contains analysis on 10% dataset using dask. Please ignore Analysis in pandas. That is an old file.

birdsarah commented 5 years ago

Added a new file --- Analysis in dask. It contains analysis on 10% dataset using dask. Please ignore Analysis in pandas. That is an old file.

Please remove obsolete files. If doing this with git isn't familiar to you don't hesitate to ask.

14Richa commented 5 years ago

Added a new file --- Analysis in dask. It contains analysis on 10% dataset using dask. Please ignore Analysis in pandas. That is an old file.

Please remove obsolete files. If doing this with git isn't familiar to you don't hesitate to ask.

I was wondering if it should be removed totally? Isn't it a good idea to have the analysis in pandas as well, for someone to use it in case they have memory/system constraints. Though I agree that the two notebooks will go out of sync very soon and it will be a hassle to keep updating both of these.

birdsarah commented 5 years ago

Isn't it a good idea to have the analysis in pandas as well, for someone to use it in case they have memory/system constraints.

I would say no. The hello_world.ipynb already shows loading data in dask vs pandas. There's no case for duplicate analysis. Analysis of one file with pandas isn't meaningful it was just a useful stepping stone for you getting to where you are. Also, you're not totally removing it, it will always be in the commit history.

14Richa commented 5 years ago

Notes as I go:

* Don't leave the print out of hundreds or thousands of rows in your notebook, it hinders comprehension. You will definitely look at this content while exploring, but clean up before review.

* `len(df.script_url.unique())` -> `df.script_url.nunique()`

* `df['location_domain'] = df.location.apply(extract_domain)` -> `df['location_domain'] = df.location.apply(extract_domain, meta='O')` ('O' is object which is all we have available for strings)

* ` df[df.symbol.str.contains('navigator.mimeTypes|navigator.plugins')]` nice

* "These days some browsers don't return an array of plugins directly, except the most common plugins such as Shockwave flash, Java, etc." citation please

Addressed the above points.

  • "That is all queries to window.navigator.plugins[Shockwave Flash].description resulted in Shockwave Flash 28.0 r0. This is strange." Why is it strange? This data was collected in a crawl. That is identical machines were setup to crawl the web and their profiles were reset between every visit to a website. "There seems to be a bias in the dataset." Agreed. "Strange but on a brighter side difficult to fingerprint :)" Unfortunately you can't make this inference because this is not a population sample of the variation of plugins. Agreed. Thanks for pointing out the flaw in the reasoning.

  • "Memory usage for df_plugins is less. I can take all of this in pd dataframe and use pivots to analyze." Good thinking. Dask does have a pivot option. But converting to pandas when you can definitely makes things nicer.

  • I was about to write: "I don't think you needed a pivot table. I think a groupby would have got you there df_plugins_pd.groupby(['location', 'script_url', 'symbol']).count()" but that is wrong. I see what you've done and I see that you're were getting the length of unique symbols. Perhaps at somepoint we can brainstorm how to make this a bit cleaner and more obvious.

  • In your analysis 2 you find 0 hs-analytics. Earlier you have noted that hs-analytics is a fingeprinting script, what do you think is going on? Addressed the issue, hs-analytics is a fingerprinting script but doesn't use plugin information. This also gives me an idea to include other symbols on top of plugin information when flagging scripts.

  • Avoid hardcoding numbers There are 166862 unique script_urls in the dataset. From this we have identified 790 (725+53+12) unique URLs which definitely host fingerprinting scripts and another 888 potential urls worth checking out. You could rewrite this as a code cell f'There are {len(unique_scripts:,} unique script_urls in the dataset. From this we have identified {sum(n_scripts)} unique URLs which definitely host fingerprinting scripts and another {n_new} potential urls worth checking out.' While it might seem counter to other things I'm arguing for the oddness of duplicated text is out-weighed by the robustness of not transcribing numbers, and the re-usability for running this against a future dataset. Addressed.

  • "So this script always asks for same 10 symbols which we can see below." Only because you've restricted your starting point to df_plugins which is the subset of scripts that calls plugins. Maybe that's what you're interested in but this statement is misleading.

  • "metrika/watch.s can be used for browser plugin fingerprinting." This is true, but I don't think you've really shown it. There is much more evidence in the dataset for you to make this claim much more convincing. Why not just look at all the symbols metrika is getting? Included more analysis around the symbols metrika is getting. This goes back to the point mentioned for hs-analytics, I should check more symbols which can be used for browser fingerprinting.

  • "I have found that the above string is a panagram and can be used in other fingprinting scripts" - I clearly still haven't explained this well enough. Let me try again. fingerprintjs2 is not just a browser attribute fingerprinting script. It also does canvas fingerprinting. It's characteristic canvas fingerprinting feature is the call to "Cwm fjordbank....". This enables you find as many of the fingerprintjs2 / fingerprintjs2-like scripts and then examine them for the browser attribute fingerprinting within them Aah, now I get what you meant here. Working on it.

  • "Therefore I have looked for "fingerprint" in the URL column of the dataset." As I already mentioned, you haven't provided evidence that this is actually capturing the fingerprintjs2 library and not just other scripts with the word fingerprint in the url which may or may not be what you want I agree to the point of catching false scripts here, though I feel that likelihood is less. Should check though.

  • The reason I care about this is because I believe you're cutting yourself off from data df[df.script_url.str.contains('fingerprint', case=False)].script_url.nunique().compute() returns 78 scripts, df[df.argument_0.str.contains('Cwm fjordbank glyphs vext quiz')].script_url.nunique().compute() returns 505 scripts. The intersection between the lists is 36. If you look at the remaining 42 I can see a number that fall into this category - although i am happy to concede there are many that do look on quick inspection like what we were looking for but were not picked up by "Cwm fjordbank...." That said, of the 505 scripts detected by "Cwm fjordbank" 499 have plugin calls. If you only wanted to take the 499 that have plugin calls as an indicator of browser attribute fingerprinting, that seems fairly reasonable. You'll have missed a few that the "fingerprint" approach got, but you're still up 400 examples with, I believe, fewer false positives. I see your point clearly now, thanks for giving examples.

  • What you've got in the current analysis seems to me like an example of how you can have your data tell you what you're expecting it to. You haven't asked questions against your initial belief - e.g. how many "Cwm fjordbank" scripts are reading plugins - which you've already articulated as a hallmark of browser attribute fingerprinting. Right, working on this.

  • Well done for exploring and then ruling out, in the interests of time, the timestamp work - you were on the right track with your thinking here and i definitely think it could be explored in the future.

  • "Is there an automatic way to download the linked javascript file from script_url and parse it to look for keywords like "murmurhash", "hashset", "fingerprint"?" Not without much pain :D. But it is partially possible. That said, future crawls will collect that data at the time of crawl and include it in the dataset.

  • I would be interested to see not just the list of symbols, but the value_counts or perhaps more comparable normalized value counts per script. you could plot these to see if they all look similar - a somewhat tricky problem but potentially illuminating.

  • Your questions at the end are starting to get into this.

Overall: A great improvement. Well done. Above are a lot of points. I think the most important of them is not the specifics but the principles in our ongoing back and forth about the validity of "fingerprint" vs "Cwm fjordbank". Moving forward, this could go on forever, but it shouldn't! I would like to think through a concrete set of refinements that will get this to a mergeable analysis contribution. I'm afraid I don't have this for you today as I have a lot of PRs to review, but lets touch base maybe next week. If you haven't heard from me, please ping me back on this PR.

Thanks for your review, I have addressed few points and I am working on the remaining. (panagram and onwards) Analysis1, 2 and 3 has been updated. I am working on Analysis 4 and 5. I have updated the PR with the latest changes, feel free to take a look.

aliamcami commented 4 years ago

Hi @14Richa, are you still planning to submit the remaining requested changes?

aliamcami commented 4 years ago

Closing this PR due to lack of activity, please feel free to reopen.