无法正确获取所有domain

Yunxi-awa commented 5 months ago

仅获取到了18comic-cn.vip站点，但在我的地区被ban了可实际上还有18comic-c.art, 18comic-c.xyz

这是禁漫天堂发布页的官方源码存储： https://github.com/jmcmomic/jmcmomic.github.io/blob/main/go/304.html 这里面的html存储的是最新的domain

hect0x7 commented 5 months ago

你用的是什么方法？ get_html_domain_all是获取全部域名的，通过访问禁漫发布页，官方这个我知道，效果和get_html_domain_all应该是一样的

Yunxi-awa commented 5 months ago

你用的是什么方法？ get_html_domain_all是获取全部域名的，通过访问禁漫发布页，官方这个我知道，效果和get_html_domain_all应该是一样的

运行代码：jmclt.get_html_domain_all() 应该是ban了

Traceback (most recent call last):
  File "D:\Pycharm\plugins\python-ce\helpers\pydev\_pydevd_bundle\pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
  File "E:\Python\3.11.6\Lib\site-packages\jmcomic\jm_client_interface.py", line 476, in get_html_domain_all
    return JmModuleConfig.get_html_domain_all(postman or self.get_root_postman())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Python\3.11.6\Lib\site-packages\common\util\decorator_util.py", line 63, in func_exec
    attr = func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "E:\Python\3.11.6\Lib\site-packages\jmcomic\jm_config.py", line 251, in get_html_domain_all
    resp = postman.get(cls.JM_PUB_URL)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Python\3.11.6\Lib\site-packages\common\postman\postman_api.py", line 125, in get
    return self.__get__()(url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Python\3.11.6\Lib\site-packages\curl_cffi\requests\__init__.py", line 92, in request
    return s.request(
           ^^^^^^^^^^
  File "E:\Python\3.11.6\Lib\site-packages\curl_cffi\requests\session.py", line 699, in request
    raise RequestsError(str(e), e.code, rsp) from e
curl_cffi.requests.errors.RequestsError: Failed to perform, ErrCode: 35, Reason: 'BoringSSL SSL_connect: Connection was reset in connection to jmcomic.ltd:443 '. This may be a libcurl error, See https://curl.se/libcurl/c/libcurl-errors.html first for more details.

hect0x7 commented 5 months ago

应该确实是被ban了

Yunxi-awa commented 5 months ago

爬github上的源码应该就不会有问题了, github一般不会出问题

hect0x7 commented 5 months ago

下个版本打算加入通过github获取域名的功能，代码如下：

class JmModuleConfig:
    @classmethod
    def get_html_domain_all_via_github(cls,
                                       postman=None,
                                       template='https://jmcmomic.github.io/go/{}.html',
                                       index_range=(300, 309)
                                       ):
        domain_set = set()

        def fetch_domain(url):
            resp = postman.get(url, allow_redirects=False)
            text = resp.text
            from .jm_toolkit import JmcomicText
            for domain in JmcomicText.analyse_jm_pub_html(text):
                if domain.startswith('jm365.work'):
                    continue
                domain_set.add(domain)

        from common import multi_thread_launcher

        multi_thread_launcher(
            iter_objs=[template.format(i) for i in range(*index_range)],
            apply_each_obj_func=fetch_domain,
        )

        return domain_set

Yunxi-awa commented 5 months ago

有了作者大大我的数据库插件才能继续写下去(虽然已经演化成一个基于jmcomic库独立项目了

hect0x7 / JMComic-Crawler-Python

无法正确获取所有domain #211