privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale
https://privacytechlab.org/
MIT License
4 stars 3 forks source link

Design protocol for determining crawl accuracy over time #136

Open SebastianZimmeck opened 2 months ago

SebastianZimmeck commented 2 months ago

@katehausladen provided some initial analysis accuracy analysis as shown in our draft paper (section 3.5). Starting with the September crawl (#118) we should come up with a protocol to check for each crawl going forward 100 randomly selected sites manually whether the crawl results are accurate. As we are crawling over longer periods of time we might otherwise see a drift in accuracy, for example, due to code changes or site changes, and, thus, should keep an eye on it.

I am particularly concerned about the following:

A few comments:

The bottom line is, we need a protocol that allows us to check the analysis accuracy of our different conditions (including sub-conditions) for every crawl to keep track of analysis accuracy over time. Since we need to do it every crawl and it involves manual work, it should be manageable time-wise but also meaningful.

@natelevinson10 will take the lead here and work with @franciscawijaya and @eakubilo before starting the next crawl.

natelevinson10 commented 1 month ago

Did a quick review of the manual data we collected a couple of weeks ago and targeted instances of a mismatch (marked in red) signaling our ground truth was different than the crawl data. I used several VPN locations (California, multiple Colorado, Virginia, and no VPN (CT)) and gave ample time to let all of the site content to load.

I was not able to find a single instance of our manual data changing from what we had reported, except for bumble.com 's USPapi_before being "1YNN" instead of the reported "1YYN", and I would chalk this up to a manual error on our end. It would seem that for the mismatches of data from crawl to manual, the manual data is more accurate.

As for the comments left by @SebastianZimmeck above, cookies do seem to load on every refresh - I have yet to find an instance where an OptAnon cookie does not load where it should. I plan to do some more testing on this to be certain over the next few days. As for our site sample skew, I believe it could be worth it to have a subset of websites we know to have GPP / OTGPPConsent data. A thought I have is to compile a list of websites we know to have all our needed behaviors (i.e. USP API opts out after receiving a GPC signal vs USP API already opted out before receiving a GPC signal ETC.) as these are crucial in our crawl list to get a holistic representation of results. I plan on seeing if there is a list of directory of websites having certain attributes that could simplify a search for these websites, if it is something we choose to do.

SebastianZimmeck commented 1 month ago

Thanks, @natelevinson10!

SebastianZimmeck commented 3 weeks ago

As discussed today, @franciscawijaya and @natelevinson10 will come up with a protocol of selecting 100 sites for a manual spotcheck of the first batch that has sufficient coverage (say, at least 5 positive instances, if possible) for each item we test for.

@franciscawijaya and @natelevinson10, you can write the protocol here in the issue for the time being.

natelevinson10 commented 2 weeks ago

To accurately assess the accuracy of the crawl data across the crawl as a whole, our protocol should focus on selecting a representative and stratified sample of 100 sites for manual review. The plan is to run a crawl batch and select 100 sites to review via the constraits below. We will then compare our results from the crawl to our manual review.

After compiling this list of 100 sites, we will manually check them in a similar fashion as we did with the CO crawl here. We will use our same methodology for verifying the maunal results here. Here is our initial plan @franciscawijaya and I reviewed, let us know what you think.