privacy-tech-lab / privacy-pioneer-web-crawler

Web crawler for detecting websites' data collection and sharing practices at scale using Privacy Pioneer
https://privacytechlab.org/
MIT License
1 stars 0 forks source link

Perform crawl #62

Open SebastianZimmeck opened 1 month ago

SebastianZimmeck commented 1 month ago

Once the testing is done (#9), we create a release (#23), and start the crawl per this issue.

1. Countries

The countries to crawl are:

  1. Australia
  2. Brazil
  3. Canada
  4. Germany
  5. India
  6. Singapore
  7. South Africa
  8. South Korea
  9. Spain
  10. United States

2. Sites

The top 525 sites for each country are listed in this repo.

In addition to each countries' top 525 sites we also crawl the United States top 525 list for each country as a general list.

This will lead to 19*525 = 9,975 crawled sites in total.

3. Google Cloud VM

@atlasharry identified the Google Clould VMs for each country:

vms

seoul

@atlasharry will take the lead on the crawl and organize how others can help (however it makes sense).