qiyeboy / spider_smooc

爬取慕课网视频
362 stars 180 forks source link

抓取视频只能获取少量字节 #4

Open yangqihua opened 7 years ago

yangqihua commented 7 years ago

你好,您的项目虽然说每个视频用一个线程去抓取,但是每个视频,只抓取到一部分二进制文件后,便出现了异常,有什么好的办法可以将每个视频都完整的抓取下来吗。部分异常信息如下:

Exception` in thread Thread-47:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 520384 out of 47830612 bytes

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 532059 out of 13004076 bytes

Exception in thread Thread-18:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 585460 out of 6128527 bytes

当前下载进度:---------------->>>>>>>> 6.47%Exception in thread Thread-48:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 582540 out of 24403607 bytes

Exception in thread Thread-36:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 532065 out of 10005207 bytes

Exception in thread Thread-35:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 532058 out of 49727052 bytes

Exception in thread Thread-40:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 586084 out of 62159002 bytes

当前下载进度:---------------->>>>>>>> 6.50%Exception in thread Thread-7:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 532063 out of 20505701 bytes

Exception in thread Thread-6:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 532065 out of 61492854 bytes

Exception in thread Thread-46:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 527684 out of 14292045 bytes

当前下载进度:---------------->>>>>>>> 6.53%Exception in thread Thread-2:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 586084 out of 10502982 bytes

Exception in thread Thread-5:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Users/yangqihua/Documents/project/python_project/spider_smooc/filedeal/file_downloader.py", line 24, in run
    urllib.urlretrieve(fileurl,filepath, self.Schedule)#下载文件
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 289, in retrieve
    "of %i bytes" % (read, size), result)
ContentTooShortError: retrieval incomplete: got only 586087 out of 9053251 bytes
yangqihua commented 7 years ago

该原因应该是由于本地网速原因,可不可以将原程序改成单线程爬取,获取限制线程的个数,因为,假设慕课网某门课有100节,你本地网速只有200kb/s的话,则每个视频所分到的网速则只有2kb/s,必然会导致上面的错误,所以是不是可以考虑爬取的最大线程数(因为爬视频不像爬文字或者图片,瓶颈不在于cpu利用率不够,爬视频的瓶颈在于网速不够)。

qiyeboy commented 7 years ago

我明天看一下程序,做一些调整