Closed SebastianZimmeck closed 2 years ago
The first question is: can we as humans make a determination, for example, whether a location is shared user data or the website owner's address? If yes, we can go forward. If no, there is nothing that we can do. In other words, is a model ever interpretable?
From observations while developing, I think it is a yes. Maybe a small amount of ambiguity in the rare case, but the average case is clear.
As an update: I'm working on getting annotation started. I chose to use docano from the list above. You host the app locally on docker. I believe it is set up well to collaborate.
I put a bunch of the http contexts (100 characters before and after a piece of data found in the data and scripts repo) into a big text file and annotated like this:
Using these labels:
I will push the Docano output JSON to the data/scripts repo.
It looks like we would need to host the docker container on AWS or Azure in order for us all to collaborate in the same environment. It might be easier at our scale to just use the same labels (you can import them with a config file I will also push) and then separate out the files that we want to annotate.
I'm not sure about the pros/cons of annotating many different data-contexts in the same file (like I did in the example) vs separating out each data-context into its own file. We can discuss at the next meeting.
So the sequenceLabel
folder in the data-and-scripts repo has the content that I produced here.
I am closing this issue. Nothing really concrete to do here. Feel free to reopen, anyone.
Today we had a nice discussion on NLP/ML techniques to use in our privacy practice analysis.
1. Background
We have done most of the engineering work on Privacy Pioneer. We went broad. Now, we need to refine our analysis techniques. We go deep. We are transitioning from engineering into research. The question is, can we improve over current state-of-the-art analysis techniques for location collection, fingerprinting, etc. If it is possible, we should do it. If it is not possible, we can do the following:
If it works, we can also:
2. Framing privacy practice analysis as sequence tagging (or named entity recognition) problem
@harkous remarked that it is not sufficient that we find in web traffic, say, a location on a website. The location could be just the address of the website owner. It does not need to be related at all to the user of the site. So, we would not only need to understand that a location being spelled out here but rather also the context of this location. Is it the user's location that was just retrieved from some API call or is it just the website owners address. The answer cannot be known without the context.
Addressing the sequence tagging problem seems a good natural fit for ML/NLP techniques. Can we learn the context surrounding a location and make the right call?
3. Challenges
The first question is: can we as humans make a determination, for example, whether a location is shared user data or the website owner's address? If yes, we can go forward. If no, there is nothing that we can do. In other words, is a model ever interpretable?
We can further think of our privacy practice analysis in terms for precision and recall. Ideally, we want to improve both, but maybe we naturally find to focus on one or the other. This is a question of tuning our ML models.
We can use a lot of interesting techniques, but are they really better than what we currently have? What is our baseline at the moment anyways?
Using off-the-shelf ML/NLP models does likely not work as there is probably no model for our problem. Maybe, there are models for identifying nouns in an English text or any sentences with negative sentiments. However, there is likely no model for identifying privacy practices in code.
I have had success with a combination of rule-based and ML/NLP techniques. First apply, rules, possibly to identify any text that is remotely related to location (so we have a high recall), then use ML/NLP to refine and weed out any false positives to get a high precision.
Can we put a machine learning model into our extension or would it be computationally to intense?
4. How to approach the problem?
Opportunistic!!! It does not matter so much which practice exactly we analyze or which techniques we use. We should just play around and see what interesting thing come up.
5. What concretely to do
6. Links
Here are all the links that @harkous provided during our call today: