lightnovel-center / linovelib2epub

Crawl light novel from some websites and convert it to epub.
https://pypi.org/project/linovelib2epub/
GNU Affero General Public License v3.0
70 stars 8 forks source link

真白萌只抓取到大标题没有内容 #39

Closed LiarOnce closed 3 months ago

LiarOnce commented 5 months ago

如题,内容明显不符,而日志看起来没有异常 image

代码如下:

from linovelib2epub import Linovelib2Epub, TargetSite
bookId = 251
browserPath = "C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"

if __name__ == '__main__':
    linovelib_epub = Linovelib2Epub(book_id=bookId, target_site=TargetSite.MASIRO, browser_path=browserPath)
    linovelib_epub.run()

日志:

2024-03-24,23:16:09 INFO     Linovelib2Epub         linovel.py                :428   ================================================================================
2024-03-24,23:25:41 INFO     MasiroSpider           masiro_spider.py          :250   -> 已登录
2024-03-24,23:25:45 INFO     MasiroSpider           masiro_spider.py          :454   User points balance is 646.
2024-03-24,23:25:45 INFO     MasiroSpider           masiro_spider.py          :98    当前所有卷都是免费积分或你已经购买,直接执行下载。
2024-03-24,23:25:45 INFO     MasiroSpider           masiro_spider.py          :131   page url set = 0
2024-03-24,23:25:45 INFO     MasiroSpider           masiro_spider.py          :139   DOWNLOAD_PAGES concurrency level: 1.
2024-03-24,23:25:45 INFO     MasiroSpider           base_spider.py            :330   volume: 公告 
2024-03-24,23:25:45 INFO     MasiroSpider           base_spider.py            :330   volume: 译名整合 
2024-03-24,23:25:45 INFO     MasiroSpider           base_spider.py            :330   volume: 文库插图 
2024-03-24,23:25:45 INFO     MasiroSpider           base_spider.py            :330   volume: web版 
2024-03-24,23:25:45 INFO     Linovelib2Epub         linovel.py                :418   The data of book(id=251) except image files is ready.
2024-03-24,23:25:45 INFO     MasiroSpider           base_spider.py            :207   Image download strategy: ASYNCIO
2024-03-24,23:25:45 INFO     MasiroSpider           base_spider.py            :215   len of image list: 1
2024-03-24,23:25:45 INFO     MasiroSpider           base_spider.py            :112   len of light_novel_images= 1
2024-03-24,23:25:48 INFO     MasiroSpider           base_spider.py            :183   image url https://masiro.me/images/encode/cover-210615175153-CpMM.png => local relative path novel_images/masiro.me/251/cover-210615175153-CpMM.png ok.
2024-03-24,23:25:48 INFO     MasiroSpider           base_spider.py            :154   SUCCEED_COUNT: 1
2024-03-24,23:25:48 INFO     MasiroSpider           base_spider.py            :155   [NEXT TURN]Pending task count: 0
2024-03-24,23:25:48 INFO     MasiroSpider           base_spider.py            :193   (Perf metrics) Download Images took: 3.654403300024569 seconds
2024-03-24,23:25:48 INFO     EpubWriter             linovel.py                :42    [Config]: has_illustration: True; divide_volume: False
2024-03-24,23:25:48 INFO     EpubWriter             linovel.py                :60    (Perf metrics) Write epub took: 0.0743418000638485 seconds
2024-03-24,23:25:48 INFO     Linovelib2Epub         linovel.py                :425   Write epub finished. Now delete all the artifacts if set.
2024-03-24,23:25:48 INFO     Linovelib2Epub         linovel.py                :428   ================================================================================
wdpm commented 5 months ago

日志是正常的。关于抓取不到内容的理由,日志中也提到了,page url set =0。至于为何page urls为0,目前我不在电脑前,没法排查。

wdpm commented 5 months ago

@LiarOnce 更新源码,以验证是否修复。

LiarOnce commented 5 months ago

可以正常捕获,但最后生成的时候报错了:

Traceback (most recent call last):
  File "c:\Users\LiarOnce\Documents\lightnovel\masiro.py", line 7, in <module>
    linovelib_epub.run()
  File "C:\Users\LiarOnce\Documents\lightnovel\src\linovelib2epub\linovel.py", line 414, in run
    novel = self._spider.fetch()
            ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\LiarOnce\Documents\lightnovel\src\linovelib2epub\spider\masiro_spider.py", line 55, in fetch
    novel = asyncio.run(self._fetch())
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\LiarOnce\miniconda3\envs\lightnovel\Lib\asyncio\runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Users\LiarOnce\miniconda3\envs\lightnovel\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\LiarOnce\miniconda3\envs\lightnovel\Lib\asyncio\base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\Users\LiarOnce\Documents\lightnovel\src\linovelib2epub\spider\masiro_spider.py", line 68, in _fetch
    novel = await self._crawl_book_by_browser(book_url, page, login_info)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\LiarOnce\Documents\lightnovel\src\linovelib2epub\spider\masiro_spider.py", line 125, in _crawl_book_by_browser
    await self.fetch_chapters(session, final_catalog_list, new_novel)
  File "C:\Users\LiarOnce\Documents\lightnovel\src\linovelib2epub\spider\base_spider.py", line 325, in fetch_chapters
    url_to_page[url] = self.extract_body_content(page)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\LiarOnce\Documents\lightnovel\src\linovelib2epub\spider\masiro_spider.py", line 778, in extract_body_content 
    body_content = html_content.find('div', {'class': 'nvl-content'}).prettify()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'prettify'
wdpm commented 5 months ago
from linovelib2epub import Linovelib2Epub, TargetSite

if __name__ == '__main__':
    bookId = 251
    browserPath = "C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"
    linovelib_epub = Linovelib2Epub(book_id=bookId, target_site=TargetSite.MASIRO, browser_path=browserPath,
                                    log_level='DEBUG')
    linovelib_epub.run()

更新代码重试

LiarOnce commented 5 months ago

很抱歉我现在才回复,但目前的问题是真白萌在爬取一段时间后依然会要求Cloudflare验证,而此时爬虫日志没有体现,因此在网站要求验证开始之后的抓取结果都是空白的。

建议支持暂存爬取结果。

wdpm commented 5 months ago

真白萌在爬取一段时间后依然会要求Cloudflare验证

是的,之前对masiro的爬取并没有处理这个情况,导致爬取健壮性较差。该情况的解决方案有以下几种: 第一种:暂存爬取结果,爬取一页就保存一页到本地,不断重试直到所有pages的内容都获取到。 第二种:爬取pages时添加判断Cloudflare验证的机制,如果发现要求验证,那就进行挑战,后续继续重试这个page。

目前已经实现了第二种。

代码位置 linovelib2epub.spider.masiro_spider.MasiroSpider._download_page

image

如果你有其他疑问,欢迎提出。

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 0 days.

github-actions[bot] commented 3 months ago

This issue was closed because it has been stalled for 0 days with no activity.