Closed thebucketmouse closed 5 years ago
Hi, thanks for the report. I just tried running corpuscrawler on urdu, and it has gotten quite a bit farther along than you have reported. I hate to say that it may be a problem with your internet / ISP blocking the requests to bbc.com.
For now I'm going to close the issue. If you can test using different wifi networks or running the tool on a different machine (such as a GCP instance), and if you still can reproduce the issue, please re-open the ticket. Thanks!
After seeing your reply I dug deeper and found that I was having a massive ping spike on my wireless connection exactly every 10 seconds. I think this has been causing me problems with other software as well. I searched online and found a fix, and now Corpus Crawler works perfectly and is running right now! Sorry I blamed my hardware problem on your program!
I am trying to use this crawler to build an Urdu corpus. I am running Ubuntu 18.04 inside a VMWare virtual machine. The crawler will start and successfully download a few links but will eventually get permanently hung up. Nothing happens until I ctrl-c to exit the script. I can kill the script, start it again and it will successfully get the link it got hung up on the previous run, it will then successfully crawl a few more until getting hung up again. The below copied text is an example of what I get when I kill the script with ctrl-c
...(the crawler has successfully downloaded several links so far)... Downloading: https://www.bbc.com/urdu/entertainment-37527961 Downloading: https://www.bbc.com/urdu/entertainment-37529481 Downloading: https://www.bbc.com/urdu/entertainment-37531642 Downloading: https://www.bbc.com/urdu/entertainment-37532975 ^CTraceback (most recent call last): File "./corpuscrawler", line 28, in
sys.exit(corpuscrawler.main.main())
File "/home/thebucketmouse/Desktop/corpuscrawler-master/Lib/corpuscrawler/main.py", line 1249, in main
crawlsargs.language
File "/home/thebucketmouse/Desktop/corpuscrawler-master/Lib/corpuscrawler/crawl_ur.py", line 22, in crawl
crawl_bbc_news(crawler, out, urlprefix='/urdu/')
File "/home/thebucketmouse/Desktop/corpuscrawler-master/Lib/corpuscrawler/util.py", line 475, in crawl_bbc_news
fetchresult = crawler.fetch(url)
File "/home/thebucketmouse/Desktop/corpuscrawler-master/Lib/corpuscrawler/util.py", line 150, in fetch
content = response.read()
File "/usr/lib/python2.7/socket.py", line 355, in read
data = self._sock.recv(rbufsize)
File "/usr/lib/python2.7/httplib.py", line 597, in read
s = self.fp.read(amt)
File "/usr/lib/python2.7/socket.py", line 384, in read
data = self._sock.recv(left)
File "/usr/lib/python2.7/ssl.py", line 772, in recv
return self.read(buflen)
File "/usr/lib/python2.7/ssl.py", line 659, in read
v = self._sslobj.read(len)
KeyboardInterrupt