privacy-tech-lab / privacy-pioneer-web-crawler

Web crawler for detecting websites' data collection and sharing practices at scale using Privacy Pioneer
https://privacytechlab.org/
MIT License
0 stars 0 forks source link

Testing Keywords in PP, in preparation of the crawl #12

Closed dadak-dom closed 5 months ago

dadak-dom commented 6 months ago

As discussed in our last meeting, we need to find a way to make check that keywords (especially email) can be triggered in PP. First, we'll check to see if they work normally in PP before trying to automate them in a crawl.

SebastianZimmeck commented 6 months ago

As background info for email address keywords, here is an NY Times article on the use of email addresses as ad identifier. Maybe, there is a way to trigger the Trade Desk integrated on sites (not sure; longshot). Also, EFF After Cookies, Ad Tech Wants to Use Your Email to Track You Everywhere.

JoeChampeau commented 5 months ago

I've added in functionality for automatically clicking through the Google sign-in popup that'll appear for sites that integrate sign-in via Google. It seems to trigger a first party request with the email contained in it like we expect, but further testing is needed to make sure this is working properly.

Furthermore:

  1. Currently, the login doesn't seem to proceed unless the page is refreshed after the prompt is closed. We should make sure to account for duplicate data if refreshing remains necessary. Also, one of the pages, realtor.com, threw a bot detection error page after a refresh, so that could be an issue.
  2. We should consider how this will affect the crawler's timeout duration. Should we wait for the whole duration initially prescribed before attempting to login, and then wait the same duration again after login? Should it be split? (i.e. 22 seconds in total, with sites checking halfway through if a Google login notification has loaded, logging in if necessary, and then waiting the other half?
SebastianZimmeck commented 5 months ago

Excellent work, @JoeChampeau! Let's discuss the points you raise today.

SebastianZimmeck commented 5 months ago

We decided to only test/crawl for the following functionality:

Thus, at this point it is no longer necessary to explore this issue. We may reopen this again if we decide differently.