howie6879 / ruia

Async Python 3.6+ web scraping micro-framework based on asyncio
https://www.howie6879.com/ruia/
Apache License 2.0
1.75k stars 181 forks source link

多个 spider 同时开始,实现真的异步 #69

Closed ctaoist closed 5 years ago

ctaoist commented 5 years ago

按照现在的设计思路,应该是1个网站一个spider的模式了,但是想同时抓取多个网站,有没有办法同时开多个spider,同时抓取。

而且设计成类方法和类变量的话感觉不是很灵活,对于这种情况。

howie6879 commented 5 years ago

有提供,如下例子

await spider01.start()
await spider02.start()
ctaoist commented 5 years ago

start() 方法不用await 吧, 我试过:

await spider01.async_start()
await spider01.async_start()

但是我看结果,还是第一个结束后,第二个才开始的。 结果是这样的:

[2019:06:15 12:32:42] INFO  Spider  Spider started!
[2019:06:15 12:32:42] INFO  Spider  Worker started: 140268566589992
[2019:06:15 12:32:42] INFO  Spider  Worker started: 140268566590152
.....
Spider finished!
[2019:06:15 12:32:42] INFO  Spider  Spider started!
[2019:06:15 12:32:42] INFO  Spider  Worker started: 140268566590312
[2019:06:15 12:32:42] INFO  Spider  Worker started: 140268566590472
...
Spider finished!

真正异步的结果类似这种才是吧:

[2019:06:15 12:32:42] INFO  Spider  Spider started!
[2019:06:15 12:32:42] INFO  Spider  Worker started: 140268566589992
[2019:06:15 12:32:42] INFO  Spider  Worker started: 140268566590152
[2019:06:15 12:32:42] INFO  Spider  Spider started!
[2019:06:15 12:32:42] INFO  Spider  Worker started: 140268566590312
[2019:06:15 12:32:42] INFO  Spider  Worker started: 140268566590472
...
..
Spider finished!
...
Spider finished!
howie6879 commented 5 years ago

对,方法名我说错了,可以实现的,你用法不对而已,如下:

import asyncio

loop = asyncio.get_event_loop()

coros = [MiddlewareSpiderDemo.async_start(middleware=middleware),
             MiddlewareSpiderDemo.async_start(middleware=middleware)]
loop.run_until_complete(asyncio.gather(*coros))
ctaoist commented 5 years ago

我也试过这种,但是提示这种错误:

File "uvloop/loop.pyx", line 1451, in uvloop.loop.Loop.run_until_complete
concurrent.futures._base.CancelledError
howie6879 commented 5 years ago

多个爬虫需要异步并行启动需要使用者自己来取消一些任务,既然你有这个需求,我就把两种形式都加上

更新代码到0.6.0

pip install git+https://github.com/howie6879/ruia
if __name__ == '__main__':
    async def main():
        tasks = [MiddlewareSpiderDemo.async_start(middleware=middleware, cancel_tasks=False),
                 MiddlewareSpiderDemo.async_start(middleware=middleware, cancel_tasks=False)]
        results = await asyncio.gather(*tasks)

        # 加上这段代码
        tasks = []
        for task in asyncio.Task.all_tasks():
            if task is not asyncio.tasks.Task.current_task():
                tasks.append(task)
                task.cancel()
        await asyncio.gather(*tasks, return_exceptions=True)

    import asyncio

    loop = asyncio.get_event_loop()

    loop.run_until_complete(main())
    loop.close()
ctaoist commented 5 years ago

我去试试,大佬效率!