privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale
https://privacytechlab.org/
MIT License
4 stars 2 forks source link

Remove well-known functionality from crawler and document python script for getting .well-known #93

Closed Mattm27 closed 7 months ago

Mattm27 commented 7 months ago

We must remove the .well-known functionality from web crawler implemented in Jan. 2024. Instead, we will be using a Python script to collect this well-known data so testing and documentation is needed for that as well.

katehausladen commented 7 months ago

I added the well-known python script to the repo. In order to run this, you will also need to have redo-original-sites.csv and redo-sites.csv in the selenium-optmeowt-crawler folder. Since these files change each crawl, I'm not putting them into the repo. They can be found in the Crawl_Data_Feb_2024 and Crawl_Data_Dec_2023 folders, for the February and December crawls, respectively. This information and instructions of how to start the script are in the python script. I'll add a section to the readme repeating this information.

katehausladen commented 7 months ago

I updated the readme to include a section on running the well-known python script. I took out the well known column + description from the readme. I added a few steps to the wiki to explain how the data should be saved in order for it to integrate with the colabs.

Updated architecture diagrams: web-crawler-architecture.pptx

Mattm27 commented 7 months ago

Note to update the Wiki with SQL command not including a well-known column when creating the 'entries' database after this functionality is stripped from the crawler.