Implement simple batch analysis feature

SebastianZimmeck commented 3 years ago

Today we discussed additional things that we can do, especially, a manual analysis of Do Not Sell metrics provided by larger websites. Another point I can think of is to implement a batch analysis feature. At the moment, we are building our extension such that when a user is in analysis mode, they navigate to a new website, the analysis is provide, they navigate to another website, the analysis is provided, and so on. We could implement some kind of batch functionality that allows users to give as input a set of domains and the output is a CSV with compliance analysis results (or analysis results are written to the analysis page and can be imported from there; maybe that is easier to reuse our current functionality). That way we could scale our analysis. I imagine something at the order of 1K to 100K sites. This would be useful for both (1) an additional research aspect and (2) additional artifact.

For (1) we would get a survey of how many sites implement Do Not Sell links and US Privacy Strings (and how many do not). We would also know how many sites are compliant. We would probably only still be able to contact a smaller fraction of sites only because we are not automatically crawling for email address to contact the sites (and that seems also tricky).

I can see two ways of going about this:

1. Built the functionality into OptMeowt itself

This could be built directly into OptMeowt. Essentially, read in a domain list from an external file, use JS APIs for opening and closing tabs and, as always, record the results.

Advantages: Because pages are loaded in the browser UI, this type of crawling is likely indistinguishable from a user's manual browsing and will be less subject to all the usual problems associated with crawling (captchas, IP address banning, etc.)
Disadvantages: This would be inherently slow as actual websites are loaded in the browser UI (as opposed to a headless Selenium instance; see. below). Another disadvantage may be that it adds yet another level of complexity to an otherwise already somewhat complex extensions (as far as browser extensions are going). On the other hand, it may be not that bad.

2. Use an external script to drive a Selenium instance with OptMeowt installed (in our script and dat repo)

The crawling functionality could also be external to OptMeowt. I played around with this option using a basic setup of Selenium for Firefox and installing Firefox extensions in Selenium. There are also some setup steps, such as changing the path variable. Essentially, OptMeowt would stay as is and we provide an external script and setup instructions to do a crawl.

Advantages: No additional complexity in OptMeowt's codebase. Potentially easier to fix problems as analysis and crawling logic is clearly separated. Probably, also quicker, especially, with the headless crawling option. A little bit more elegant in headless mode as there is no visible opening and closing of tabs, navigating, etc.
Disadvantages: Probably, easier detectable as automated crawling with all problems coming with it. On the other hand, we are hitting different websites and only briefly. So, maybe not a big deal. Setup is more complicated, especially, we would need to ensure that OptMeowt is in Analysis mode and not Protection mode.

So, should we do that? If so, how?

SebastianZimmeck commented 3 years ago

We decided to give it a go with a simple in-extension batch analysis mode. We try that out first, and see how it goes. @kalicki1 you are probably in the best position to take the lead on this one; possibly with some help by @OliverWang13.

SebastianZimmeck commented 3 years ago

This could be a good Pro feature if we decide to go a startup route with an open core business model, for example. We should have clarity on this before implementation. This could be done via a code checked at the backend or a if batch analysis is not part of the extension codebase, do not open source that code.

SebastianZimmeck commented 3 years ago

As @kalicki1 is currently exploring how pages can be reloaded as part of the analysis mode, that reload functionality will also inform our further dealings here.

SebastianZimmeck commented 2 years ago

Maybe, we pick this up later.

privacy-tech-lab / gpc-optmeowt

Implement simple batch analysis feature #220

1. Built the functionality into OptMeowt itself

2. Use an external script to drive a Selenium instance with OptMeowt installed (in our script and dat repo)