Explore ML/NLP techniques for analyzing our privacy practices

SebastianZimmeck commented 2 years ago

Today we had a nice discussion on NLP/ML techniques to use in our privacy practice analysis.

1. Background

We have done most of the engineering work on Privacy Pioneer. We went broad. Now, we need to refine our analysis techniques. We go deep. We are transitioning from engineering into research. The question is, can we improve over current state-of-the-art analysis techniques for location collection, fingerprinting, etc. If it is possible, we should do it. If it is not possible, we can do the following:

Label usability study (which we are going to do no matter what)
Data science study (optional, e.g., top 500 sites labels)

If it works, we can also:

Give a hand-labeled ML/NLP corpus for training/testing to the public

2. Framing privacy practice analysis as sequence tagging (or named entity recognition) problem

@harkous remarked that it is not sufficient that we find in web traffic, say, a location on a website. The location could be just the address of the website owner. It does not need to be related at all to the user of the site. So, we would not only need to understand that a location being spelled out here but rather also the context of this location. Is it the user's location that was just retrieved from some API call or is it just the website owners address. The answer cannot be known without the context.

Addressing the sequence tagging problem seems a good natural fit for ML/NLP techniques. Can we learn the context surrounding a location and make the right call?

3. Challenges

The first question is: can we as humans make a determination, for example, whether a location is shared user data or the website owner's address? If yes, we can go forward. If no, there is nothing that we can do. In other words, is a model ever interpretable?

We can further think of our privacy practice analysis in terms for precision and recall. Ideally, we want to improve both, but maybe we naturally find to focus on one or the other. This is a question of tuning our ML models.

Precision: Whenever we make a call we want to be right, and we are fine with missing some positive cases.
Recall: We are fine with falsely classifying some cases as positives as long as we are really comprehensive and do not miss any cases.

We can use a lot of interesting techniques, but are they really better than what we currently have? What is our baseline at the moment anyways?

Using off-the-shelf ML/NLP models does likely not work as there is probably no model for our problem. Maybe, there are models for identifying nouns in an English text or any sentences with negative sentiments. However, there is likely no model for identifying privacy practices in code.

I have had success with a combination of rule-based and ML/NLP techniques. First apply, rules, possibly to identify any text that is remotely related to location (so we have a high recall), then use ML/NLP to refine and weed out any false positives to get a high precision.

Can we put a machine learning model into our extension or would it be computationally to intense?

4. How to approach the problem?

Opportunistic!!! It does not matter so much which practice exactly we analyze or which techniques we use. We should just play around and see what interesting thing come up.

5. What concretely to do

@notowen333, @danielgoldelman, and @Lr-Brown will start with looking into JS/Python libraries that we possibly want to use and see whether it is possible at all to incorporate it into Privacy Pioneer. It does not matter that the model does not fit. The question is, can it be done at all.
Then, we go from there. Try to think of some ideas how the ML/NLP techniques can be used to improve the privacy practice analysis that you are familiar with (e.g., fingerprinting, pixel, ...)

6. Links

Here are all the links that @harkous provided during our call today:

notowen333 commented 2 years ago

The first question is: can we as humans make a determination, for example, whether a location is shared user data or the website owner's address? If yes, we can go forward. If no, there is nothing that we can do. In other words, is a model ever interpretable?

From observations while developing, I think it is a yes. Maybe a small amount of ambiguity in the rare case, but the average case is clear.

As an update: I'm working on getting annotation started. I chose to use docano from the list above. You host the app locally on docker. I believe it is set up well to collaborate.

notowen333 commented 2 years ago

I put a bunch of the http contexts (100 characters before and after a piece of data found in the data and scripts repo) into a big text file and annotated like this:

Using these labels:

I will push the Docano output JSON to the data/scripts repo.

It looks like we would need to host the docker container on AWS or Azure in order for us all to collaborate in the same environment. It might be easier at our scale to just use the same labels (you can import them with a config file I will also push) and then separate out the files that we want to annotate.

I'm not sure about the pros/cons of annotating many different data-contexts in the same file (like I did in the example) vs separating out each data-context into its own file. We can discuss at the next meeting.

So the sequenceLabel folder in the data-and-scripts repo has the content that I produced here.

SebastianZimmeck commented 2 years ago

I am closing this issue. Nothing really concrete to do here. Feel free to reopen, anyone.

privacy-tech-lab / privacy-pioneer