privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale
https://privacytechlab.org/
MIT License
4 stars 2 forks source link

Develop a cloud crawler #20

Closed Jocelyn0830 closed 1 year ago

Jocelyn0830 commented 1 year ago

As we discussed, we are going to build a crawler that utilizes cloud computing.

SebastianZimmeck commented 1 year ago

@Jocelyn0830 is leading the effort and @sophieeng and @katehausladen will join as soon as they are available.

Jocelyn0830 commented 1 year ago

As discussed in last meeting, it seems that Google Cloud Functions work the best for us. In order to deploy Selenium script to Google Cloud Run, we need to run Selenium in Docker using Remote Webdriver and built a Docker image for our script. I successfully ran our script in Docker, however, since we are using a remote webdriver, we are not able to export the csv file to a local directory. Therefore, I am thinking about building an online database to store our exported data (maybe also input sites).

Jocelyn0830 commented 1 year ago

Resources that might be useful for us: https://github.com/ccorcos/gcloud-functions-selenium-boilerplate https://www.roelpeters.be/how-to-deploy-a-scraping-script-and-selenium-in-google-cloud-run/ https://towardsdatascience.com/scraping-the-web-with-selenium-on-google-cloud-composer-airflow-7f74c211d1a1

Jocelyn0830 commented 1 year ago

After exploring, Firebase does not provide support for Firefox Addon. It seems that LocalBase may be something that we can use since it states that it gives us an offline database with the simplicity & power of Firebase and it relies on IndexedDB database (also used in our extension). However, it is not ideal, since all the data is stored in the user's browser. We still need to somehow transfer the data to our local host since the browser is running in Docker.

I believe that there is some way that I can download the csv file from Docker to the local machine. I am still working on it.

SebastianZimmeck commented 1 year ago

Firebase does not provide support for Firefox Addon.

That is not necessarily an issue. We need a database. The service we use for crawling can be separate from the service we use for the database.

SebastianZimmeck commented 1 year ago

As we discussed today, @Jocelyn0830 and @katehausladen will explore a bit more: (1) the architecture of our new crawler and database and (2) which services we can use to cloudify it.

A bit more detail:

(1) Architecture: (a) Crawler: Our crawler (i.e., Selenium headless browser + Firefox extension) does not need to have its own database. So, the IndexedDB can go out. (b) Database: Our crawler can send all crawl data to a backend database that is separate. For example, the database can be a SQL/MySQL/SQLite database. The extension contains JavaScript code for connecting to the database and sending data there via HTTP requests (e.g., POST requests).

(2) Cloudification: (a) Crawler: @Jocelyn0830 already has some ideas for the crawler, e.g., deploying a Docker image on some service. For example, here is a tutorial mentioning Google Cloud Run. (b) Database: For the database, there are different Google Cloud Databases. For example, Cloud SQL for MySQL may be an option. @katehausladen knows about MySQL and can help @Jocelyn0830 to create a database. Before the cloudification, it may be worthwhile to create a local database (e.g., using XAMPP) to see how it goes.

These are just suggestions. We may go a different route, use different tools and technologies, etc. The important point right now is to explore the possibilities. Then, we converge.

SebastianZimmeck commented 1 year ago

Glitch may be also worthwhile to check out for a cloud SQL database (for (2)(b) above). In particular, there is a little sample project with a SQLite database. It also includes a front-end with two web pages that connect to the database. Instead of the webpages we would use our browser extension. Glitch is similar to Heroku but free (Heroku just remove their free tier).

SebastianZimmeck commented 1 year ago

As discussed @Jocelyn0830 and @katehausladen will scope out the tech stack. We are pretty much set on the frontend (Firefox extension in JS), but we need to find what we use for the backend (e.g., PHY and MySQL, Python/Django, ...) and which cloud offering to use. The idea is to spend some time exploring, for example, with XAMPP or other technologies. Maybe there are also good tutorials.

Jocelyn0830 commented 1 year ago

For now, we are working on determining the tech stacks for the cloud crawler. Concrete tasks will be documented in other issues.

katehausladen commented 1 year ago

https://towardsdatascience.com/build-a-scalable-web-crawler-with-selenium-and-pyhton-9c0c23e3ebe5 may be another useful resource. The github repo for it is here: https://github.com/Postiii/twds-crawler.

This tutorial as well as the tutorials Jocelyn posted earlier all use Docker. (https://www.roelpeters.be/how-to-deploy-a-scraping-script-and-selenium-in-google-cloud-run/ https://towardsdatascience.com/scraping-the-web-with-selenium-on-google-cloud-composer-airflow-7f74c211d1a1).

I will try to see if there is an advantage to using Google Cloud Run, Google Kubernetes Engine, or Google Kubernetes Engine with Google Cloud Composer (airflow), as that is the main difference between the tutorials.

katehausladen commented 1 year ago

In order to get more familiar with docker, I ran the selenium crawler using docker locally. Since the steps haven’t really been documented anywhere as far as I know, here are the instructions I followed. At the time of writing these, there was no xpi file in the repo. When there is one, that could be used rather than compressing it yourself.

It appeared that the crawl was successful, but I could not find the resulting csv file. Jocelyn said she has the same issue. However, going forward this should not be a problem as we will be using the database in google cloud rather than a csv file. In the meeting today, Jocelyn said she has successfully used the google cloud database with the local-crawler.js file.

Now that I am a bit more familiar with docker and the current selenium crawler, I will try to follow one of the tutorials to put the selenium crawler into the cloud.

SebastianZimmeck commented 1 year ago

Great write-up, @katehausladen! Could you add the Selenium Docker instruction to the readme (assuming that is how we are doing things)?

Jocelyn0830 commented 1 year ago

I modified the local-crawler code and docker-crawler code to work with our new analysis extension. Kate provided instructions on how to run them locally on port 4444 in the previous comment. I have been working on deploying the crawler script in the Google Cloud Run. I went through some setups in my Google Cloud Run, and I believe that we need to use Google Cloud Container Registry. I will provide detailed instructions for @SebastianZimmeck once I figure out everything.

I have done below to set up the cloud crawler:

  1. Pull standalone Firefox image from docker. Configure it and push it to our Google Cloud Container Registry.
  2. Write a docker image for the crawler script in the format of Dockerfile. Test locally and make sure it works.

Remaining work:

  1. Push the crawler image to the Google Cloud Container Registry.
  2. Run the dockerfile in the cloud.
SebastianZimmeck commented 1 year ago

Great work @Jocelyn0830 and @katehausladen, once you figured out the whole setup let's include it in the readme (to the extent it is not already in there):

  1. Development Setup
  2. Deployment Setup
Jocelyn0830 commented 1 year ago

We have set up the Google Cloud MySQL and Container Registry using the lab google account. We pushed the selenium standalone firefox nightly image to our registry and successfully deployed it. We also deployed the rest api. Now when we run the crawler script locally, selenium is actually running in a cloud docker container. Our extension is installed in a headless browser in the docker selenium grid. The extension then ran and sent data to the cloud rest api. Kate found out that analysis results of only some sites listed on the sites.csv file were successfully stored in the Cloud MySQL database. It meant that we got the crawler working but we still need to check its performance. We may need to adjust some timeout settings in the crawler script and see if we can have a higher success rate.

SebastianZimmeck commented 1 year ago

Nice work, @Jocelyn0830 and @katehausladen!

Jocelyn0830 commented 1 year ago

At this stage, we have developed a cloud crawler. Closed.