Open birdsarah opened 5 years ago
I uploaded a notebook with basic examples for finding each of the scripts here: https://github.com/mozilla/overscripted/blob/master/analyses/issue_34_setup_and_dask_tips.ipynb
I also gave additional information in the chat. Pasting here too:
hs-analytics:
akam:
fingerprintjs2:
Hi I am interested in this. I want to work on it.
Hi @birdsarah, I was looking for this issue and as the notebook uploaded by you already performs all the 3 tasks as mentioned in the issue. So, Can you please explain some detail information regarding what more changes are required to be performed in the notebook in order to solve this issue.
Hi @birdsarah , I am applying for outreachy.
Do you think its a good idea to detect canvas fingerprinting. I am thinking on the lines of detecting unnecessary canvas elements. But I am not entirely sure how to detect which elements are not needed.
Generally canvas fingerprinting is done by calling the ToDataURL() method. I am assuming there is no real reason genuine scripts need to get the canvas image in DataURL format. Do you have any suggestions for me?
@srujana121 @muskankhedia I will try and answer both your questions together. @srujana121 there is no need to develop a technique for detecting fingerprinting. This has already been developed and examples are in the literature. See "Online Tracking: A 1-million-site Measurement and Analysis " and "The Web's Sixth Sense" on the reading list: https://github.com/mozilla/overscripted/wiki/Reading-List-(WIP)
In particular, the code for detecting four types of fingerprinting we're interested in (canvas, font, audio, and webrtc) is available here: https://github.com/sensor-js/OpenWPM-mobile/blob/mobile_sensors/feature_extraction/extract_features.py
@willougr has done the work of applying these heuristics to our dataset and will be submitting his code shortly. Some of the results of his work are here: https://github.com/mozilla/overscripted/blob/master/analyses/2018_12_willoughr__fingerprinting_prevalence.txt
This issue is about developing code like that shown at https://github.com/sensor-js/OpenWPM-mobile/blob/mobile_sensors/feature_extraction/extract_features.py but finding a set of rules that detect browser attribute fingerprinting, that is the type of fingerprinting that compiles together a series of browser attributes. Again, the reading list articules will elaborate this type of fingerprinting in more detail.
The notebook supplied @muskankhedia does not solve this issue it provides the code to filter some relevant scripts out of the whole dataset. The hard work is then developing a "heuristic" that picks out these scripts and others like it. By "heuristic" I mean a rule-set encoded in code that selects for specific scripts and not others.
For in the case of canvas fingerprinting, the heursitic in extract_features.py
looks for scripts that call toDataUrl
but do not call save
, restore
, or addEventListener
(along with some other things).
Hi @birdsarah,
I have some doubts regarding this, do we have to make a list of such scripts used for browser attribute fingerprinting and search for all of them individually using a looping or we have to create a function to automatically search for such scripts based on some parameters.
@muskankhedia have you reviewed "https://github.com/sensor-js/OpenWPM-mobile/blob/mobile_sensors/feature_extraction/extract_features.py"? Which of the papers in the reading list have you reviewed? What did you learn from them?
Please reformat your question "In the authors do ____. When I tried to do , I was stuck by _. As a result I have the following question ___."
@birdsarah
In "https://sensor-js.xyz/webs-sixth-sense-ccs18.pdf" the authors find the trackers by clustering the scripts that use sensor information. So this is what I have understood. Can you tell me if this is what I have to do.
Hi @srujana121 .... I'm having a little trouble answering this. So I'm going to say up front that there's no right answers here. That's the hard part about data exploration and research. There's no hidden hint in the rest of what I write here about what I think is a "best" direction. The following is just notes and observations not direction.
What you posted is not what you have to do, but it is an approach. There are multiple ways of approaching the problem of building a heuristic.
You could keep investigating other approaches, and document their differences, strengths, and weaknesses. Or you pursue this approach.
If you pursue the approach you outlined I would be surprised if you were able to finish an undertaking like that in a couple of weeks. But that doesn't mean you shouldn't start. But given that it's a big job think about the interim outputs. Think about documenting your background research, your methodology, and how you will measure success. This preparation and thinking work alone can be a solid contribution. In addition, thinking through questions like how you will measure success will likely help you hone your methodology. If you're moving quickly, then post that preparation document early as a PR, get feedback and start a conversation about moving your analysis along.
The work from @willougr has been posted: https://github.com/mozilla/overscripted/tree/master/analyses/2019_03_willougr_fingerprinting_implementation_sixth_sense - it has a small bug in it so if you're trying to run it yourself you may need to fix up the variable names for the data file path - but other than that it's good. This applies the heuristics used for detecting audio, canvas, font, and webrtc by the Sixth Sense paper to the OverScripted dataset.
@Victory17 I missed your message before. Permission is not required to work on issues. Just dive right in.
Adding for clarity: Browser attribute fingerprinting is a kind of browser fingerprinting in which a bunch of browser specific attributes are collected and used to uniquely identify a browser. Eg. It could be something like a hash generated using a known algorithm which concats attributes like screen-size, resolution, font etc to a string and hashes that string. Now this hash will most likely be unique to a browser from which it was generated. Relevant paper.
I think this is a fantastic topic as I work in the realm of GDPR in the UK and Europe and privacy laws here in the US. Just brainstorming, but looking at this, may it be a good idea to see if we can look into the fingerprinting on browsers or countries where privacy with internet is quite strictly regulated? Although the GDPR stays quite clear of some technology, I think this might give us a good way to establish similarities in scripts that pull the necessary data, the major changes to track fingerprinting scripts and also to look at what is really considered as true fingerprinting to identify an individual? I will continue to find other angles to find ways to create what is needed for this.
@14Richa bit confused by your last comment - is it aimed at me or tikwiza? always good to use an @ someone.
@birdsarah oops, Apologies. I added it for general discussion use case. File summarizes some threads I started chasing.
There are some scripts that we can pick out by name that are doing browser attribute fingerprinting:
hs-analytics
in the script_url/akam/
in the script_urlCan we build a heuristic for browser attribute fingerprinting that pulls out these scripts?