Closed SebastianZimmeck closed 8 months ago
It looks like our extension does have this functionality already.
In contentScript.js:
This function was being called during the crawl we already ran, but there was no listener in analysis.js (or any other file) listening for the message CONTENT_SCRIPT_WELLKNOWN
. So, we were not storing that data in our last crawl. We'd just need to add a listener to analysis.js and create a column in the database to store that data.
As discussed yesterday, we wanted to get .well-known/gpc.json data for our first crawl. I wrote a python script to look for .well-known/gpc.json and ran it on our full set of sites. I used the full set with the redo sites replaced so that I could just do it in 1 run. I put the script and the result in the drive. Going forward, we can use the crawler to collect the data.
As discussed today, @Mattm27 will set up the crawler and look into adding the .well-known functionality with the help of @katehausladen.
@SebastianZimmeck, do we need to do a validation/test set for well-known? Or can we just assume that the Python requests.get function (in the case of the data I collected Jan 2) and the Javascript fetch function (which will be used by the extension on subsequent crawls) will correctly return the json data if it exists?
I'd say we do not need a validation/test set for well-known (unless you are aware of any instances that were retrieved incorrectly, i.e., there was a well-known but it was missed or there was not a well-known but the crawler returned a site).
Ok, I agree; I just wanted to make sure.
@Mattm27, could you update the wiki to include the new SQL command you used to create the entries database?
@Mattm27, could you update the wiki to include the new SQL command you used to create the entries database?
Sure thing. I will do that today.
Closing this issue as the wiki has been updated to include a new SQL command for entries database.
It occurred to me that we should also check whether a site has the GPC support resource, e.g., such as
Not sure if we have the data to check for the crawl that we already did and/or if we can implement additional functionality in the crawler.