Closed JoeChampeau closed 2 months ago
As per our last meeting, I've been doing some preliminary research for the sites we should crawl, and I've found some options that we could consider.
Those are just a couple of the options available, however, a lot of the other ones that I've seen look like they require some sort of payment, or only allow you to view the top 50 per category, or don't offer a .csv download.
What about a custom list? The problem with Tranco is that is has .gov, .org, and other sites where probably not much is going on in terms of data collection and sharing.
Possibly, we can start with Tranco and adapt it.
I was able to download the BuiltWith list for free, which has their top million sites. It also says on this page that we can use it however we like (besides selling it), so it seems like we have the green light for that list. I took a quick look at both the BuiltWith and Tranco lists, and both seemed to have ".edu", ".gov", etc., so it seems like we'll probably have to adapt whatever list we choose to use. I'll try running a quick scrape tomorrow with the Tranco and see what happens.
What set of websites do we want to crawl? How large should this set be? A selection from the Tranco list is always a good option.
- Do we want to do region-specific crawls using a VPN?
To add on to this, we need to look into what variables we want to investigate and why we're investigating them. In this issue, I'll start a log of different angles I've tried (e.g. comparing Builtwith results vs. Tranco, using a VPN in California, etc.) so that in the future, we can refer back.
I ran a quick scrape of the top 100 sites on builtwith (without a vpn). Two things that were interesting were that there were significantly fewer HumanCheck errors and that there was significantly more data being gathered. I think I'm going to keep looking at the builtwith list for now, looking at different domains, and later switching my location with the VPN. Here's the data I gathered, in case anyone wanted to see: nov13run.txt
Here are some results from just running sites with .gov from builtwith (no vpn):
Looking at the differences between US and non-US government sites might be an interesting thing to look into. Here's the data from this scrape: nov14run.csv
I ran a scrape of ~100 of the top sites from BuiltWith that were .edu, both with and without a VPN. The VPN was set to Los Angeles, California.
Note for later: hss.edu would cause the crawler to crash entirely, so for the time being, be sure to remove it from a crawl list
Here are my suggestions for the crawl lists: In terms of locations, my suggestion is the following:
In terms of actual lists, here are my two suggestions (the actual lists will be attached at the bottom):
BuiltWith top 2000: After some modifications, I think this is a strong list because of the following:
Some downsides:
Second option (my preferred) BuiltWith + Majestic Million: I made this list by combining the top 1000 of each list (Majestic Million ranks their sites by sites with the most referring subnets).
I think this list is better because it provides us with the solid foundation of Builtwith, and then on top of that, we get sites that are likely to be used by everyday users, such as shopping sites and social media, among others. This way, we get not only a good spread of TLDs, but also good coverage of the different ways in which people use the internet.
I've attached both lists here. sugg1builtwith.txt sugg2combo.txt
Thanks, @dadak-dom!
Location | VPN | Privacy Law | Official Language | Connection Strength (per Mullvad |
---|---|---|---|---|
Middletown, Connecticut, US | No | CTDPA | English | N/A |
Los Angeles, California, US | Yes | CCPA | English | 10 Gbps |
Miami, Florida, US | Yes | None | English | 10 Gbps |
Dublin, Ireland | Yes | GDPR, ePrivacy Directive | English | 10 Gbps |
Can we use the Tranco List (possibly, omitting some URLs based on our own criteria)?
I can start making a suggestion of 1000 sites from this list, yes đ
Is there an explanation of the different BuiltWith categories ("TechSpend" or "Traffic")? (I think no; calling @katehausladen.) I am a bit hesitant of making arguments based on methodologies that we do not understand or have information about how they were applied. At a minimum, we would need to acknowledge this point as a limitation in our paper.
That makes sense. It seems like we feel much more comfortable with the Tranco list, so I'll start working with that (as above).
Instead of 5 locations with 2,000 sites each should we do 10 locations with 1,000 sites each? That way, we could get a more comprehensive picture of the different privacy laws. I would also think that 2,000 would not give twice the insight that 1,000 sites would give. If we opt for 10 locations, what would those be?
I'll look into this as well đ
@SebastianZimmeck Here's an idea for the locations we could use:
Location | VPN | Privacy Law | Language | Connection Strength |
---|---|---|---|---|
Miami, Florida | Yes | None | English | 10 gbps |
Los Angeles, California | Yes | CCPA | English | 10 gbps |
London, UK | Yes | UK-GDPR | English | 10-20 gpbs |
Dublin, Ireland | Yes | GDPR, ePrivacy Directive | English | 10 gbps |
Kyiv, Ukraine | Yes | On Protection of Personal Data | Ukrainian | 10 gbps |
Johannesburg, South Africa | Yes | Protection of Personal Information Act | English (among others) | 10 gbps |
Singapore | Yes | Personal Data Protection Act | English (among others) | 10 gbps |
Melbourne, Australia | Yes | Privacy Act 1988 | English | 10 gbps |
Auckland, New Zealand | Yes | The Privacy Act 2020 | English | 10 gbps |
Sao Paulo, Brazil | Yes | LGPD | Portugese | Either 1 or 10 gbps, depends on server used |