google / corpuscrawler

Crawler for linguistic corpora
Other
190 stars 56 forks source link

crawler gets hung after downloading a few hits #44

Closed thebucketmouse closed 5 years ago

thebucketmouse commented 5 years ago

I am trying to use this crawler to build an Urdu corpus. I am running Ubuntu 18.04 inside a VMWare virtual machine. The crawler will start and successfully download a few links but will eventually get permanently hung up. Nothing happens until I ctrl-c to exit the script. I can kill the script, start it again and it will successfully get the link it got hung up on the previous run, it will then successfully crawl a few more until getting hung up again. The below copied text is an example of what I get when I kill the script with ctrl-c

...(the crawler has successfully downloaded several links so far)... Downloading: https://www.bbc.com/urdu/entertainment-37527961 Downloading: https://www.bbc.com/urdu/entertainment-37529481 Downloading: https://www.bbc.com/urdu/entertainment-37531642 Downloading: https://www.bbc.com/urdu/entertainment-37532975 ^CTraceback (most recent call last): File "./corpuscrawler", line 28, in sys.exit(corpuscrawler.main.main()) File "/home/thebucketmouse/Desktop/corpuscrawler-master/Lib/corpuscrawler/main.py", line 1249, in main crawlsargs.language File "/home/thebucketmouse/Desktop/corpuscrawler-master/Lib/corpuscrawler/crawl_ur.py", line 22, in crawl crawl_bbc_news(crawler, out, urlprefix='/urdu/') File "/home/thebucketmouse/Desktop/corpuscrawler-master/Lib/corpuscrawler/util.py", line 475, in crawl_bbc_news fetchresult = crawler.fetch(url) File "/home/thebucketmouse/Desktop/corpuscrawler-master/Lib/corpuscrawler/util.py", line 150, in fetch content = response.read() File "/usr/lib/python2.7/socket.py", line 355, in read data = self._sock.recv(rbufsize) File "/usr/lib/python2.7/httplib.py", line 597, in read s = self.fp.read(amt) File "/usr/lib/python2.7/socket.py", line 384, in read data = self._sock.recv(left) File "/usr/lib/python2.7/ssl.py", line 772, in recv return self.read(buflen) File "/usr/lib/python2.7/ssl.py", line 659, in read v = self._sslobj.read(len) KeyboardInterrupt

sffc commented 5 years ago

Hi, thanks for the report. I just tried running corpuscrawler on urdu, and it has gotten quite a bit farther along than you have reported. I hate to say that it may be a problem with your internet / ISP blocking the requests to bbc.com.

For now I'm going to close the issue. If you can test using different wifi networks or running the tool on a different machine (such as a GCP instance), and if you still can reproduce the issue, please re-open the ticket. Thanks!

thebucketmouse commented 5 years ago

After seeing your reply I dug deeper and found that I was having a massive ping spike on my wireless connection exactly every 10 seconds. I think this has been causing me problems with other software as well. I searched online and found a fix, and now Corpus Crawler works perfectly and is running right now! Sorry I blamed my hardware problem on your program!