Closed LiarOnce closed 3 months ago
日志是正常的。关于抓取不到内容的理由,日志中也提到了,page url set =0。至于为何page urls为0,目前我不在电脑前,没法排查。
@LiarOnce 更新源码,以验证是否修复。
可以正常捕获,但最后生成的时候报错了:
Traceback (most recent call last):
File "c:\Users\LiarOnce\Documents\lightnovel\masiro.py", line 7, in <module>
linovelib_epub.run()
File "C:\Users\LiarOnce\Documents\lightnovel\src\linovelib2epub\linovel.py", line 414, in run
novel = self._spider.fetch()
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\LiarOnce\Documents\lightnovel\src\linovelib2epub\spider\masiro_spider.py", line 55, in fetch
novel = asyncio.run(self._fetch())
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\LiarOnce\miniconda3\envs\lightnovel\Lib\asyncio\runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "C:\Users\LiarOnce\miniconda3\envs\lightnovel\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\LiarOnce\miniconda3\envs\lightnovel\Lib\asyncio\base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Users\LiarOnce\Documents\lightnovel\src\linovelib2epub\spider\masiro_spider.py", line 68, in _fetch
novel = await self._crawl_book_by_browser(book_url, page, login_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\LiarOnce\Documents\lightnovel\src\linovelib2epub\spider\masiro_spider.py", line 125, in _crawl_book_by_browser
await self.fetch_chapters(session, final_catalog_list, new_novel)
File "C:\Users\LiarOnce\Documents\lightnovel\src\linovelib2epub\spider\base_spider.py", line 325, in fetch_chapters
url_to_page[url] = self.extract_body_content(page)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\LiarOnce\Documents\lightnovel\src\linovelib2epub\spider\masiro_spider.py", line 778, in extract_body_content
body_content = html_content.find('div', {'class': 'nvl-content'}).prettify()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'prettify'
from linovelib2epub import Linovelib2Epub, TargetSite
if __name__ == '__main__':
bookId = 251
browserPath = "C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"
linovelib_epub = Linovelib2Epub(book_id=bookId, target_site=TargetSite.MASIRO, browser_path=browserPath,
log_level='DEBUG')
linovelib_epub.run()
更新代码重试
很抱歉我现在才回复,但目前的问题是真白萌在爬取一段时间后依然会要求Cloudflare验证,而此时爬虫日志没有体现,因此在网站要求验证开始之后的抓取结果都是空白的。
建议支持暂存爬取结果。
真白萌在爬取一段时间后依然会要求Cloudflare验证
是的,之前对masiro的爬取并没有处理这个情况,导致爬取健壮性较差。该情况的解决方案有以下几种: 第一种:暂存爬取结果,爬取一页就保存一页到本地,不断重试直到所有pages的内容都获取到。 第二种:爬取pages时添加判断Cloudflare验证的机制,如果发现要求验证,那就进行挑战,后续继续重试这个page。
目前已经实现了第二种。
代码位置 linovelib2epub.spider.masiro_spider.MasiroSpider._download_page
如果你有其他疑问,欢迎提出。
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 0 days.
This issue was closed because it has been stalled for 0 days with no activity.
如题,内容明显不符,而日志看起来没有异常
代码如下:
日志: