privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale
https://privacytechlab.org/
MIT License
3 stars 1 forks source link

Perform February Crawl #85

Closed katehausladen closed 5 months ago

katehausladen commented 5 months ago

I talked to Daniel, and this week is the best week for me to have the computer. I started the crawl last night.

katehausladen commented 5 months ago

The crawl is finished! GPP implementation nearly doubled since December!

katehausladen commented 5 months ago

I reopened the issue to merge the code used for this crawl. Since I crawled the whole crawl set with these changes, I went ahead and just merged the changes. The changes were (1) cap the debugging table entries at 4,000 characters, since that is what our table allows (2) add another human check regular expression and (3) update the readme to reflect wellknown changes.

katehausladen commented 5 months ago

here's the updated analysis flow / architecture powerpoint web-crawler-architecture.pptx