Closed katehausladen closed 11 months ago
I just tested the crawler on my own mac (using newly merged main branch). The instructions are really easy to follow and local sql database can be easily set up. I crawled the validation set Kate mentioned above. The crawler ran very well and didn't crash at all. I played around with it and found that the crawler was able to automatically restart even I closed the window.
Result: There are 50 sites in the validation set. The success rate is 100%, meaning that crawler got every site. Total time used is about 26 minutes.
I will update Mac Mini results shortly.
I finished testing on Mac Mini as well. Looking at the terminal output, the crawler crashes several times but was robust enough to restart each time.
Result: In the end, the crawler was able to crawl all 50 sites. The success rate is 100%. Total time used is about 29 minutes.
Could you save the database data from both crawls as a json? One way to do this is just go to http://localhost:8080/analysis, right click, select save as, and save it as a json.
Then we can compare the crawl data to the ground truths. You can just put the files here.
Could you save the database data from both crawls as a json? One way to do this is just go to http://localhost:8080/analysis, right click, select save as, and save it as a json.
Then we can compare the crawl data to the ground truths. You can just put the files here.
@katehausladen should be completed now :)
We will be using Sebastian's old computer.
Back in May in issue #37, we discussed which computers we would use for our crawl but never really came to a concrete conclusion:
Since crawling 10,000 sites will take multiple days to complete, I think it would probably be best if we did not use our own computers. I think now would be a good time to decide whether we should get new Mac minis or if the current ones will work.
Some things to consider would be:
@Jocelyn0830, if you could do a small crawl on one of the Mac minis (assuming Professor Danner is not actively using both of them), that would be great. You can use this validation set (sites + Ground Truths). You can either run the crawl on your mac to compare or just compare the mac mini results to the results I got on my Mac. The run I did used the VPN, took 1601 seconds, and had no errors logged in error-logging.json. This google colab will help with the comparison.
If Oliver hasn't merged the issue-60 branch by the time you get to this, run the crawl from the issue-60 branch. The sql db creation command is in the PR.