privacy-tech-lab / privacy-pioneer-web-crawler

Web crawler for detecting websites' data collection and sharing practices at scale using Privacy Pioneer
https://privacytechlab.org/
MIT License
0 stars 0 forks source link

Explore web crawler #3

Closed SebastianZimmeck closed 8 months ago

SebastianZimmeck commented 8 months ago

As discussed yesterday and today, we want to explore to which extent the current selenium-optmeowt-crawler can also be used for purposes of Privacy Pioneer. We certainly need to modify the crawler such that it would work for both GPC analysis and Privacy Pioneer. The base assumption is that we can do it. However, if in the course of the work we find that it is not possible, we will implement two separate crawlers. But before going that route we should make sure that both extensions are sufficiently different. @danielgoldelman, @JoeChampeau, and @jjeancharles will look at this from the Privacy Pioneer end and @katehausladen from the GPC web end (there may also be some modifications necessary, e.g., some abstractions for how long to stay on site until moving to the next, error handling, ... to accommodate .

SebastianZimmeck commented 8 months ago

As discussed on MS Teams, we will revisit whether it makes sense to separate out the crawler with common functionality into its own repo once we have finalized both the concrete GPC Web and Privacy Pioneer crawler implementations:

[11:47 AM] Kate Hausladen

Daniel and I went though the GPC crawler script to determine which parts would also be used for the Privacy Pioneer crawler. The code we found would be common to any selenium crawler that visits a set of sites from a csv and accomplishes the following tasks: (1) importing libraries and the csv file, (2) initializing the Firefox browser, (3) visiting a site.

I don’t think it makes sense to merge the crawlers over a few common Selenium function calls. Since the bulk of the scripts are going to be drastically different, I think having one crawler will overcomplicate both of our projects. I do think that the overlapping code we identified can serve as a good starting point for the Privacy Pioneer crawler, as now Daniel can focus on Privacy Pioneer-specific issues as opposed to reading selenium documentation.

[12:02 PM] Sebastian Zimmeck

OK, then go ahead with separate crawlers for the time being. Once we have the crawlers we can concretely revisit the point and see if and to which extent it makes sense to create a separate crawler repo with (1) to (3).