privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale
https://privacytechlab.org/
MIT License
4 stars 2 forks source link

Which computers will we use to crawl? #69

Closed katehausladen closed 11 months ago

katehausladen commented 12 months ago

Back in May in issue #37, we discussed which computers we would use for our crawl but never really came to a concrete conclusion:

We decided in today's call to first test if adding more time would improve the performance of the Mac Minis. If that does not work, we way use everyone's laptop's for the crawl, or we may buy newer dedicated crawl Macs.

Since crawling 10,000 sites will take multiple days to complete, I think it would probably be best if we did not use our own computers. I think now would be a good time to decide whether we should get new Mac minis or if the current ones will work.

Some things to consider would be:

  1. Success rate (i.e. How many sites are incorrectly analyzed?)
  2. Crash rate (i.e. How many sites fail to be analyzed due to selenium/firefox errors?)
  3. Site loading time (i.e. Do the Mac minis just need extra time to load sites? Or do they just not load resource-intensive sites in selenium?)

@Jocelyn0830, if you could do a small crawl on one of the Mac minis (assuming Professor Danner is not actively using both of them), that would be great. You can use this validation set (sites + Ground Truths). You can either run the crawl on your mac to compare or just compare the mac mini results to the results I got on my Mac. The run I did used the VPN, took 1601 seconds, and had no errors logged in error-logging.json. This google colab will help with the comparison.

If Oliver hasn't merged the issue-60 branch by the time you get to this, run the crawl from the issue-60 branch. The sql db creation command is in the PR.

Jocelyn0830 commented 12 months ago

I just tested the crawler on my own mac (using newly merged main branch). The instructions are really easy to follow and local sql database can be easily set up. I crawled the validation set Kate mentioned above. The crawler ran very well and didn't crash at all. I played around with it and found that the crawler was able to automatically restart even I closed the window.

Result: There are 50 sites in the validation set. The success rate is 100%, meaning that crawler got every site. Total time used is about 26 minutes.

I will update Mac Mini results shortly.

Jocelyn0830 commented 12 months ago

I finished testing on Mac Mini as well. Looking at the terminal output, the crawler crashes several times but was robust enough to restart each time.

Result: In the end, the crawler was able to crawl all 50 sites. The success rate is 100%. Total time used is about 29 minutes.

katehausladen commented 12 months ago

Could you save the database data from both crawls as a json? One way to do this is just go to http://localhost:8080/analysis, right click, select save as, and save it as a json.

Screenshot 2023-10-09 at 8 18 38 PM

Then we can compare the crawl data to the ground truths. You can just put the files here.

Jocelyn0830 commented 12 months ago

Could you save the database data from both crawls as a json? One way to do this is just go to http://localhost:8080/analysis, right click, select save as, and save it as a json. Screenshot 2023-10-09 at 8 18 38 PM

Then we can compare the crawl data to the ground truths. You can just put the files here.

@katehausladen should be completed now :)

katehausladen commented 11 months ago

We will be using Sebastian's old computer.