privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale
https://privacytechlab.org/
MIT License
4 stars 2 forks source link

Check sites for .well-known/gpc.json #77

Closed SebastianZimmeck closed 8 months ago

SebastianZimmeck commented 9 months ago

It occurred to me that we should also check whether a site has the GPC support resource, e.g., such as

Not sure if we have the data to check for the crawl that we already did and/or if we can implement additional functionality in the crawler.

katehausladen commented 9 months ago

It looks like our extension does have this functionality already.

In contentScript.js:

Screenshot 2024-01-02 at 10 01 09 AM

This function was being called during the crawl we already ran, but there was no listener in analysis.js (or any other file) listening for the message CONTENT_SCRIPT_WELLKNOWN. So, we were not storing that data in our last crawl. We'd just need to add a listener to analysis.js and create a column in the database to store that data.

katehausladen commented 9 months ago

As discussed yesterday, we wanted to get .well-known/gpc.json data for our first crawl. I wrote a python script to look for .well-known/gpc.json and ran it on our full set of sites. I used the full set with the redo sites replaced so that I could just do it in 1 run. I put the script and the result in the drive. Going forward, we can use the crawler to collect the data.

SebastianZimmeck commented 9 months ago

As discussed today, @Mattm27 will set up the crawler and look into adding the .well-known functionality with the help of @katehausladen.

katehausladen commented 8 months ago

@SebastianZimmeck, do we need to do a validation/test set for well-known? Or can we just assume that the Python requests.get function (in the case of the data I collected Jan 2) and the Javascript fetch function (which will be used by the extension on subsequent crawls) will correctly return the json data if it exists?

SebastianZimmeck commented 8 months ago

I'd say we do not need a validation/test set for well-known (unless you are aware of any instances that were retrieved incorrectly, i.e., there was a well-known but it was missed or there was not a well-known but the crawler returned a site).

katehausladen commented 8 months ago

Ok, I agree; I just wanted to make sure.

katehausladen commented 8 months ago

@Mattm27, could you update the wiki to include the new SQL command you used to create the entries database?

Mattm27 commented 8 months ago

@Mattm27, could you update the wiki to include the new SQL command you used to create the entries database?

Sure thing. I will do that today.

Mattm27 commented 8 months ago

Closing this issue as the wiki has been updated to include a new SQL command for entries database.