privacy-tech-lab / privacy-pioneer-web-crawler

Web crawler for detecting websites' data collection and sharing practices at scale using Privacy Pioneer
https://privacytechlab.org/
MIT License
0 stars 0 forks source link

Update all crawl lists #27

Closed dadak-dom closed 3 months ago

dadak-dom commented 3 months ago

I noticed that our changes to the crawl lists haven't been reflected in the main branch yet. These should be the most up-to-date lists, including the changes due to the ExpressVPN switch and the Google Cloud VM switch.

dadak-dom commented 3 months ago

@JoeChampeau tagged you in this since it's related to #26 (in the sense that we want to be cleaning up our branches)

JoeChampeau commented 3 months ago

Are the lists updated to ensure there are no redirects, error pages, etc. when using Google Cloud?

dadak-dom commented 3 months ago

Are the lists updated to ensure there are no redirects, error pages, etc. when using Google Cloud?

I haven't done something like that, no. But I think that since we've already removed a lot of non-VPN-friendly sites when changing to expressVPN, it might not be necessary. Also, it would take a lot of cloud credits to actually go through all the lists again on the cloud, so it would essentially be paying for another crawl. Of course, we can't really guarantee that every site will work until we run it with Selenium, and at that point, we may as well be running "test crawls", but that wouldn't fit into out 10k site budget. For the time being, I think that these lists are a happy middle.

JoeChampeau commented 3 months ago

Seems reasonable enough to me! And, just making sure, do the new lists - e.g. India, Germany, whichever - need to be verified with a VPN or are they good to go?

dadak-dom commented 3 months ago

Seems reasonable enough to me! And, just making sure, do the new lists - e.g. India, Germany, whichever - need to be verified with a VPN or are they good to go?

Oh shoot, good catch. I didn't make the lists connected to express VPN, since it had already expired. Can't really verify with Mullvad since they don't have India. Can't really think of a good alternative at the moment... maybe let's leave this open for now, so at the very least we know what's going on with the lists / what this branch does? Then we can figure out what to do about that.

SebastianZimmeck commented 3 months ago

Are the lists updated to ensure there are no redirects, error pages, etc. when using Google Cloud?

I haven't done something like that, no. But I think that since we've already removed a lot of non-VPN-friendly sites when changing to expressVPN, it might not be necessary.

I would expect that we see some redirects, sites not working, etc. We should make sure that we know where data is coming from. For example, in case of a redirect, we should record the old and new URL as well as which data is related to which URL. (Maybe, that is a separate issue unrelated to this PR, but flagging this here.)

dadak-dom commented 3 months ago

Are the lists updated to ensure there are no redirects, error pages, etc. when using Google Cloud?

I haven't done something like that, no. But I think that since we've already removed a lot of non-VPN-friendly sites when changing to expressVPN, it might not be necessary.

I would expect that we see some redirects, sites not working, etc. We should make sure that we know where data is coming from. For example, in case of a redirect, we should record the old and new URL as well as which data is related to which URL. (Maybe, that is a separate issue unrelated to this PR, but flagging this here.)

Just to clarify, each URL in these lists has already been vetted (redirects removed, ad servers, etc.). The main concern is whether or not certain sites will behave differently on the cloud versus in a VPN. For example, we had to replace a lot of sites when switching from Mullvad to ExpressVPN because a lot of sites had ExpressVPN blocked, but not Mullvad #16. Unless we get ExpressVPN just for the sake of verifying India, Canada, and Germany, then there's not much that we can do about this point, since re-checking the crawl lists on the Cloud is not feasible.

SebastianZimmeck commented 3 months ago

then there's not much that we can do about this point, since re-checking the crawl lists on the Cloud is not feasible.

Yes, we can just test on the test set (and if goes well do the crawl).