privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale
https://privacytechlab.org/
MIT License
4 stars 3 forks source link

Explore crawler for crawling on a large scale #18

Closed Jocelyn0830 closed 2 years ago

Jocelyn0830 commented 2 years ago

We cam explore the possibility of doing cloud crawling.

SebastianZimmeck commented 2 years ago

1. Cloud Platforms

With 27K+ site that we want to crawl, maybe, even repeatedly, it is worthwhile to look into which cloud providers we can use. In principle there are three:

  1. Google Cloud Platform (GCP)
  2. Microsoft Azure (Azure)
  3. Amazon Web Services (AWS)

All of these can get pretty complicated, especially, AWS. So, my sense is that GPC or Azure would be easier. For example, there is Azure Virtual Desktop. This is probably not the best for our use case but I am mentioning it because of the Headless vs. Non-headless Mode point below. The best I know Firebase, which is also part of GCP. But it is also probably not what we are looking for.

Heroku may also be an option. But there the issue is that we do not immediately have an OS to run a browser on.

->The basic questions is: what do people use for crawling the web at scale?

2. Headless vs. Non-headless Mode

One important point that @Jocelyn0830 brought up today is that we may run into an issue with the headless vs. non-headless mode. If we are switching to some cloud architecture, can we still crawl in non-headless mode? If so, do we need to do any modification, and which modifications are those. If we are not able to use non-headless mode, can we modify our crawler to work in headless mode?

3. Deploying our Crawler

A less important point that we can look into once we figured the previous two points is how to deploy. For example, Puppeteer mentions Docker.

SebastianZimmeck commented 2 years ago

The main point that @Jocelyn0830 mentioned in our crawl is that in headless mode it may not be possible to install extensions. So, is there a workaround for this? The other issue --- using keyword commands to trigger running and stopping the analysis --- seems less of an issue. @Jocelyn0830 will continue exploring ...

Jocelyn0830 commented 2 years ago

Headless vs. Non-headless Mode

I read over different documentations of cloud platforms, and most cloud platforms only support headless automate browsers. It seems to me that the easiest way is to develop a crawler that works in headless mode.

Here are the problems if we want to make a headless crawler:

  1. We need to install our extension programmatically to the Firefox browser since there is no ui in a headless browser.
  2. Upon installation, our extension is set to protection mode by default. We need to switch the extension to analysis mode. The two possible solutions are (1) create keyboard shortcuts for switching mode (2) develop an extension version dedicated to testing that is set to analysis mode upon installation
  3. We need to download the analysis csv data programmatically. One possible solution is to create keyboard shortcuts.

Our current crawler works in headful mode, so we could do all above manually. However, having a headless crawler means that everything needs to be done programmatically.

Current Progress:

I got stuck on problem 1 described above using Puppeteer. I tried many methods in Puppeteer but they all failed, so I decided to explore other possible options. After experimenting with different webdrivers, Selenium seems to be the one that works for us. (1) Selenium supports javascript, and it provides functions similar to those used in our current crawler. (2) Selenium supports different versions of Firefox, while Puppeteer only supports Firefox Nightly. (3) Selenium provides an api that can programmatically install Firefox addon in .xpi format.

Jocelyn0830 commented 2 years ago

Several problems I encountered:

Jocelyn0830 commented 2 years ago

I successfully implemented a headless crawler using selenium.

SebastianZimmeck commented 2 years ago

As discussed today, @Jocelyn0830 will continue exploration of the backend. This can be simple (e.g., Firebase) or local (just MacMini as before).

We do not need much of a UI. We can just issue a command, the crawl starts crawling, and the data is being saved. There does not need to be much/any display via the browser.

We may also need a different backend for storing the data, e.g., results are stored in a database instead of in the csv file.

@Jocelyn0830 will create a list of features and make a first call which ones go into the crawler, which go into the extension, and which can be removed from one or the other.

Then, we need to think about organizing our codebase. We can use this or other new repos.

And if you have any one-off expenses let me know, @Jocelyn0830, and I will reimburse you. Once we settle on a definite solution, I will set up an account and credit card to handle the monies.