Closed ctaoist closed 5 years ago
有提供,如下例子
await spider01.start()
await spider02.start()
start() 方法不用await 吧, 我试过:
await spider01.async_start()
await spider01.async_start()
但是我看结果,还是第一个结束后,第二个才开始的。 结果是这样的:
[2019:06:15 12:32:42] INFO Spider Spider started!
[2019:06:15 12:32:42] INFO Spider Worker started: 140268566589992
[2019:06:15 12:32:42] INFO Spider Worker started: 140268566590152
.....
Spider finished!
[2019:06:15 12:32:42] INFO Spider Spider started!
[2019:06:15 12:32:42] INFO Spider Worker started: 140268566590312
[2019:06:15 12:32:42] INFO Spider Worker started: 140268566590472
...
Spider finished!
真正异步的结果类似这种才是吧:
[2019:06:15 12:32:42] INFO Spider Spider started!
[2019:06:15 12:32:42] INFO Spider Worker started: 140268566589992
[2019:06:15 12:32:42] INFO Spider Worker started: 140268566590152
[2019:06:15 12:32:42] INFO Spider Spider started!
[2019:06:15 12:32:42] INFO Spider Worker started: 140268566590312
[2019:06:15 12:32:42] INFO Spider Worker started: 140268566590472
...
..
Spider finished!
...
Spider finished!
对,方法名我说错了,可以实现的,你用法不对而已,如下:
import asyncio
loop = asyncio.get_event_loop()
coros = [MiddlewareSpiderDemo.async_start(middleware=middleware),
MiddlewareSpiderDemo.async_start(middleware=middleware)]
loop.run_until_complete(asyncio.gather(*coros))
我也试过这种,但是提示这种错误:
File "uvloop/loop.pyx", line 1451, in uvloop.loop.Loop.run_until_complete
concurrent.futures._base.CancelledError
多个爬虫需要异步并行启动需要使用者自己来取消一些任务,既然你有这个需求,我就把两种形式都加上
更新代码到0.6.0
pip install git+https://github.com/howie6879/ruia
if __name__ == '__main__':
async def main():
tasks = [MiddlewareSpiderDemo.async_start(middleware=middleware, cancel_tasks=False),
MiddlewareSpiderDemo.async_start(middleware=middleware, cancel_tasks=False)]
results = await asyncio.gather(*tasks)
# 加上这段代码
tasks = []
for task in asyncio.Task.all_tasks():
if task is not asyncio.tasks.Task.current_task():
tasks.append(task)
task.cancel()
await asyncio.gather(*tasks, return_exceptions=True)
import asyncio
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
我去试试,大佬效率!
按照现在的设计思路,应该是1个网站一个spider的模式了,但是想同时抓取多个网站,有没有办法同时开多个spider,同时抓取。
而且设计成类方法和类变量的话感觉不是很灵活,对于这种情况。