privacy-tech-lab / privacy-pioneer-web-crawler

Web crawler for detecting websites' data collection and sharing practices at scale using Privacy Pioneer
https://privacytechlab.org/
MIT License
0 stars 0 forks source link

Determine research direction #18

Closed SebastianZimmeck closed 3 months ago

SebastianZimmeck commented 4 months ago

It turned out to be difficult to use the VPN for different countries. Here are a few options to try:

katehausladen commented 4 months ago

Ask @katehausladen if she has some insight

I have only used the Mullvad VPN in Los Angeles for crawling. I haven't ever used it for other countries. My only suggestion would be trying multiple different IP addresses in the same location if the VPN has more than one.

SebastianZimmeck commented 4 months ago

Thanks, @katehausladen!

Indeed, it seems like the GPC crawl is different than the Privacy Pioneer crawl. @danielgoldelman will describe a bit more the problem to you and maybe you can think of additional ideas. This is just in case.

JoeChampeau commented 4 months ago

As discussed during the meeting, trying out anyIP's static residential proxies didn't bear any fruit. Seems like proxies in general may not be able to offer any advantages over cloud computing or VPNs, although it might still be worth trying out what @natelevinson10 found.

dadak-dom commented 4 months ago

I've been running test crawls on the Google Cloud VM. As @danielgoldelman mentioned, the cloud option provides a significant improvement for location elements, but performs worse for the analytics category. As discussed in the meeting, I'll start looking into what could possibly be going wrong here.

natelevinson10 commented 3 months ago

Similarly to @JoeChampeau , I also tested static ISP proxies from the provider Bright Proxies and found that they did not offer and advantages either. With these findings, we can likely close the door on the possibility of using web scraping proxies to test the crawler.

SebastianZimmeck commented 3 months ago

Sounds good, @natelevinson10!