privacy-tech-lab / privacy-pioneer-web-crawler

Web crawler for detecting websites' data collection and sharing practices at scale using Privacy Pioneer
https://privacytechlab.org/
MIT License
0 stars 0 forks source link

Create Manually Curated List of Sites to Crawl #7

Closed JoeChampeau closed 5 months ago

JoeChampeau commented 8 months ago
  1. What set of websites do we want to crawl? How large should this set be? A selection from the Tranco list is always a good option.
  2. Do we want to do region-specific crawls using a VPN?
dadak-dom commented 7 months ago

As per our last meeting, I've been doing some preliminary research for the sites we should crawl, and I've found some options that we could consider.

Those are just a couple of the options available, however, a lot of the other ones that I've seen look like they require some sort of payment, or only allow you to view the top 50 per category, or don't offer a .csv download.

SebastianZimmeck commented 7 months ago

What about a custom list? The problem with Tranco is that is has .gov, .org, and other sites where probably not much is going on in terms of data collection and sharing.

Possibly, we can start with Tranco and adapt it.

dadak-dom commented 7 months ago

I was able to download the BuiltWith list for free, which has their top million sites. It also says on this page that we can use it however we like (besides selling it), so it seems like we have the green light for that list. I took a quick look at both the BuiltWith and Tranco lists, and both seemed to have ".edu", ".gov", etc., so it seems like we'll probably have to adapt whatever list we choose to use. I'll try running a quick scrape tomorrow with the Tranco and see what happens.

dadak-dom commented 7 months ago
  1. What set of websites do we want to crawl? How large should this set be? A selection from the Tranco list is always a good option.

    1. Do we want to do region-specific crawls using a VPN?

To add on to this, we need to look into what variables we want to investigate and why we're investigating them. In this issue, I'll start a log of different angles I've tried (e.g. comparing Builtwith results vs. Tranco, using a VPN in California, etc.) so that in the future, we can refer back.

dadak-dom commented 7 months ago

I ran a quick scrape of the top 100 sites on builtwith (without a vpn). Two things that were interesting were that there were significantly fewer HumanCheck errors and that there was significantly more data being gathered. I think I'm going to keep looking at the builtwith list for now, looking at different domains, and later switching my location with the VPN. Here's the data I gathered, in case anyone wanted to see: nov13run.txt

dadak-dom commented 7 months ago

Here are some results from just running sites with .gov from builtwith (no vpn):

Looking at the differences between US and non-US government sites might be an interesting thing to look into. Here's the data from this scrape: nov14run.csv

dadak-dom commented 7 months ago

I ran a scrape of ~100 of the top sites from BuiltWith that were .edu, both with and without a VPN. The VPN was set to Los Angeles, California.

Note for later: hss.edu would cause the crawler to crash entirely, so for the time being, be sure to remove it from a crawl list

dadak-dom commented 7 months ago

Here are my suggestions for the crawl lists: In terms of locations, my suggestion is the following:

In terms of actual lists, here are my two suggestions (the actual lists will be attached at the bottom):

BuiltWith top 2000: After some modifications, I think this is a strong list because of the following:

Some downsides:

Second option (my preferred) BuiltWith + Majestic Million: I made this list by combining the top 1000 of each list (Majestic Million ranks their sites by sites with the most referring subnets).

I think this list is better because it provides us with the solid foundation of Builtwith, and then on top of that, we get sites that are likely to be used by everyday users, such as shopping sites and social media, among others. This way, we get not only a good spread of TLDs, but also good coverage of the different ways in which people use the internet.

I've attached both lists here. sugg1builtwith.txt sugg2combo.txt

SebastianZimmeck commented 7 months ago

Thanks, @dadak-dom!

1. Here is how I see it at the moment

Location VPN Privacy Law Official Language Connection Strength (per Mullvad
Middletown, Connecticut, US No CTDPA English N/A
Los Angeles, California, US Yes CCPA English 10 Gbps
Miami, Florida, US Yes None English 10 Gbps
Dublin, Ireland Yes GDPR, ePrivacy Directive English 10 Gbps

2. Questions

dadak-dom commented 7 months ago

Can we use the Tranco List (possibly, omitting some URLs based on our own criteria)?

I can start making a suggestion of 1000 sites from this list, yes 👍

Is there an explanation of the different BuiltWith categories ("TechSpend" or "Traffic")? (I think no; calling @katehausladen.) I am a bit hesitant of making arguments based on methodologies that we do not understand or have information about how they were applied. At a minimum, we would need to acknowledge this point as a limitation in our paper.

That makes sense. It seems like we feel much more comfortable with the Tranco list, so I'll start working with that (as above).

Instead of 5 locations with 2,000 sites each should we do 10 locations with 1,000 sites each? That way, we could get a more comprehensive picture of the different privacy laws. I would also think that 2,000 would not give twice the insight that 1,000 sites would give. If we opt for 10 locations, what would those be?

I'll look into this as well 👍

dadak-dom commented 7 months ago

@SebastianZimmeck Here's an idea for the locations we could use:

Location VPN Privacy Law Language Connection Strength
Miami, Florida Yes None English 10 gbps
Los Angeles, California Yes CCPA English 10 gbps
London, UK Yes UK-GDPR English 10-20 gpbs
Dublin, Ireland Yes GDPR, ePrivacy Directive English 10 gbps
Kyiv, Ukraine Yes On Protection of Personal Data Ukrainian 10 gbps
Johannesburg, South Africa Yes Protection of Personal Information Act English (among others) 10 gbps
Singapore Yes Personal Data Protection Act English (among others) 10 gbps
Melbourne, Australia Yes Privacy Act 1988 English 10 gbps
Auckland, New Zealand Yes The Privacy Act 2020 English 10 gbps
Sao Paulo, Brazil Yes LGPD Portugese Either 1 or 10 gbps, depends on server used

With this list, I was trying to get a nice spread of privacy laws and locations. You can see I've kept it fairly English, but I think that we can swap out some locations if we want a greater diversity. The Tranco list seems to do a better job of creating a diverse pool of websites. Still has an English skew, but not as much as Builtwith, I don't think.

I've also attached the Tranco list (with modifications) that you asked for.

sugg3tranco.txt

SebastianZimmeck commented 7 months ago

Nice work, @dadak-dom!

As we discussed today in our meeting, let's go for five locations:

For each we crawl with a generic top 1,000 list that is the same for each location. Then, we have a specific top 1,000 list depending on location (e.g., Brazil would be the top 1,000 .br country domains). This will give us comparability across the set of locations but also allow us to capture some location-specific results.

We will first need to spot-check for non-English speaking countries if the returned Privacy Pioneer results are good, i.e., the analysis works even if there are partially intermingled Portuguese words in the HTTP messages for the Brazil analysis.

Since all US states will have the same location list (unless we use state-specific lists, not sure how to do that, the whois database or BuiltWith maybe?), we will have some more locations for the 10,000 site budget. So, we could also think of adding one, two, three more countries/states to our list. An Asian country, maybe? Texas?

dadak-dom commented 6 months ago

@SebastianZimmeck For the country-specific lists, is there any reason why we can't use .com for the US? Based on what I could find, it seems like the US claims to have control over the domain, so it seems that we could argue in favor of that. It would also make more sense than .us, since so few sites use it compared to .com. What do you think?

SebastianZimmeck commented 6 months ago

@dadak-dom, yes, in general I see no strong reasons for why not. Some minor reasons may be that that country-specific list would be close to the generic list. But if it is the reality that the US dominates the top websites, then that is what it is. A second point is that we used .us as country-specific list for the ML training data. But again, in my mind, this is not a reason why we couldn't switch to .com now. So, unless I am missing something, yes, let's switch to .com.

dadak-dom commented 6 months ago

@danielgoldelman Here's what I could gather for the "testing" I was assigned: Brazil sites used (for documentation purposes): https://uol.com.br https://shopee.com.br https://www.amazon.com.br/ https://www.gov.br/pt-br https://olx.com.br https://mercadolivre.com.br https://terra.com.br https://caixa.gov.br https://acesso.gov.br https://www.magazineluiza.com.br/

Brazil summary: From what I could tell, it seems like PP definitely works on certain sites, while on others it finds nothing. Everything it did find seemed to be from servers that are based in English, though, so maybe it can't find any requests with Portuguese. This would probably need to be investigated further.

Ukraine sites used: https://sinoptik.ua/ https://www.olx.ua/uk/ https://www.pravda.com.ua/ https://prom.ua/ https://tsn.ua/ https://24tv.ua/ https://epicentrk.ua/ https://alerts.in.ua/ https://www.unian.ua/ https://tabletki.ua/

Ukraine summary: Similar to Brazil. Everything PP finds seems normal, so it seems more likely that, if there is a problem, it would be a problem with detecting requests that PP should (so false negative, I believe?).

I will look into other countries soon. In general, it looks like PP works, but exactly how effectively, I'm not sure.

SebastianZimmeck commented 6 months ago

Everything it did find seemed to be from servers that are based in English, though, so maybe it can't find any requests with Portuguese. This would probably need to be investigated further.

It would be great if you can make a call, @dadak-dom. I'd say, if about 10% of a set of foreign language sites that run fail to produce analysis results, we should not use that country. So, which countries clear that threshold?

dadak-dom commented 6 months ago

Of the countries I have tested (Australia, Ukraine, Brazil, Ireland, and Singapore), I think we could use Ireland and Ukraine, as they had the fewest sites with no results. Based on my results, I do not think that we should use Brazil, Singapore, or Australia. None of the countries had less than 10% failure, but that could be due to a small sample. If this seems alright, I can make a list for Ukraine TLDs before the crawl.

SebastianZimmeck commented 6 months ago

Thanks, @dadak-dom!

as they had the fewest sites with no results

A site can have a lot or just a few results. Either is OK. What matters is whether the analysis is correct on the results that are available, if any. So, take a look at the Privacy Pioneer analysis results and then try to manually evaluate whether they are correct, i.e., evaluate the ground truth. For example, you can use the browser developer tools and manually check (@danielgoldelman can provide more info on how to do a ground truth analysis).

I checked the first three Ukrainian sites (https://sinoptik.ua/, https://www.olx.ua/uk/, https://www.pravda.com.ua/). None of them had locations or personal data.

Can you test for locations (ZIP code, region, latitude, longitude, street address) and personal data (email address, phone number, custom keywords). Those would be much harder tasks than tracking and monetization because those just use deterministic techniques (e.g., rules matching URLs). Locations use our ML model.

JoeChampeau commented 6 months ago

If we haven't already, it might also be useful to ensure PP still functions well when dealing with non-English language keywords, like a Brazilian city (possibly with non-English diacritics, like in "SĂŁo Paulo") or ZIP code, that way we know PP works both:

  1. in non-English contexts (on non-English sites), and
  2. when targeting non-English data (like "SĂŁo Paulo").

Maybe visiting with a VPN based in the country in question could accomplish this?

SebastianZimmeck commented 6 months ago

If we haven't already, it might also be useful to ensure PP still functions well when dealing with non-English language keywords

Absolutely! @dadak-dom and @danielgoldelman, could you take care of that?

SebastianZimmeck commented 6 months ago

The idea is to crawl 525 location-specific sites (total location-specific sites 5,250) and 525 general sites (total general sites 5,250) for the following countries and US states (total 10,500):

dadak-dom commented 5 months ago

If we haven't already, it might also be useful to ensure PP still functions well when dealing with non-English language keywords

Absolutely! @dadak-dom and @danielgoldelman, could you take care of that?

Okay, so here's what it looks like in terms of the foreign languages: Accents didn't seem to have any impact on the detection of locations. I compared what Privacy Pioneer found versus all the requests, and it didn't look like the extension was missing anything, even when the city had an accent. However, keywords were a different story. I'm not entirely sure why, but PP would replace accents with some other character, so it would not be able to find custom keywords with accents accurately. So here are my suggestions, depending on how we want to do the crawl.

If we are concerning ourselves with custom, general keywords, then we might want to replace Spain, Ukraine, and Brazil. By this, I mean that we are crawling with an instance of PP that will be on the lookout for a custom keyword that has an accent. If not, I think we can keep Brazil and Spain.

However, I think that Ukraine needs to be replaced. It doesn't look like PP knows how to handle the different alphabet, so it would flood the extension with false positives for keywords.

For replacing Ukraine, I would suggest the following three countries, and then @SebastianZimmeck , if you could let me know what you think, that'd be great.

JoeChampeau commented 5 months ago

@dadak-dom Do you happen to have an example of a site and keyword from which the issue could be replicated? Regardless of whether or not we end up implementing general keywords for the crawl, it's probably worth looking into potential fixes for PP.

dadak-dom commented 5 months ago

@JoeChampeau That makes sense. For the accents, an example would be if you go to sodexobeneficios.com.br and search your keyword in the search bar (for example, my keyword was "hollĂ ", and PP would identify it as "holl&". If I did something like "hollĂ com", then PP wouldn't find anything)

For Ukrainian, I would translate something like "hello" and paste it into the search bar of https://sinoptik.ua/. Once I had a keyword in Ukrainian, PP would find a bunch of keywords that didn't actually exist. If I remember correctly, it would claim that I had a keyword "reqU", and it would find the keyword in a bunch of requests.

Hopefully this helps.

SebastianZimmeck commented 5 months ago

OK, let's remove Brazil, Spain, and Ukraine. Here is a new list:

@dadak-dom, can you check:

@JoeChampeau, maybe take a shallow look into the character issue @dadak-dom describes. If this is an easy fix or implementation mistake, we can fix it. But probably not worth it to spend a huge amount of time on it.

dadak-dom commented 5 months ago

@SebastianZimmeck Just looked into your questions, and here's what I could gather:

I'll get started on Canada and Germany, and if you could let me know a preference for the third, that would be great. Maybe France? It doesn't look like we have any more Asian countries to choose from.

SebastianZimmeck commented 5 months ago

From a cursory glance, Hong Kong looks like it has a mix of English and Chinese sites. When I make the list, I could potentially just remove the Chinese sites, but I think it might make more sense to just replace it.

OK, then let's replace it.

I'll get started on Canada and Germany, and if you could let me know a preference for the third, that would be great. Maybe France?

France would be good. But possibly there are also issues with accents. If that is the case, let's pick Florida US to have one US location without a privacy law.

SebastianZimmeck commented 5 months ago

One more point, Germany has À etc. Not sure if that makes a difference.

Also, in general, which sites we select and how they will work depends on what we are going to test, i.e., the testing protocol (#9). Is testing the keywords (#12 ) even part of the protocol?

SebastianZimmeck commented 5 months ago

We are using the following list:

The reason is that we are not testing for keywords, emails, and phone numbers. Location should be good even for non-English sites.