kitUIN / PicImageSearch

整合图片识别 API,用于以图搜源 / Aggregator for Reverse Image Search API
https://pic-image-search.kituin.fun/
MIT License
384 stars 46 forks source link

是否可以为 ASCII2D添加 base_url ? #115

Closed Container-Zero closed 2 months ago

Container-Zero commented 3 months ago

类似于

google = GoogleSync(proxies=proxies, base_url=base_url)
resp = google.search(url=url)

google可以通过自定义base_url来选择镜像源,希望ascii2d也可以 目的是通过自建ascii2d反代站点到安全环境,来永久避开cf的爬虫检测,一劳永逸

Container-Zero commented 3 months ago

追加一下请求: 希望其它接口都能加上此参数,可以避免网络问题或实现负载均衡 如saucenao本身有每天针对同一IP的请求数量以及速度限制(即使用了Token),通过搭建复数镜像站可以实现负载均衡,以实现开放式公共查询API之类的服务

Container-Zero commented 3 months ago

对了我还是说一下,我提这个 issue 的动机我判断目前 ASCII2D 遇到了 CF 的拦截,但仔细想了想我并没有办法完全确定这一点,以下是我这边遇到的 ASCII2D 的具体报错:

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.9/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.9/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.9/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 758, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 778, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 299, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 79, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.9/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 74, in app
    response = await func(request)
  File "/usr/local/lib/python3.9/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.9/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/app/main.py", line 83, in ascii2d
    resp = await ascii2d.search(url=url)
  File "/usr/local/lib/python3.9/dist-packages/PicImageSearch/ascii2d.py", line 65, in search
    return Ascii2DResponse(resp.text, resp.url)
  File "/usr/local/lib/python3.9/dist-packages/PicImageSearch/model/ascii2d.py", line 144, in __init__
    data = PyQuery(fromstring(resp_text, parser=utf8_parser))
  File "/usr/local/lib/python3.9/dist-packages/lxml/html/__init__.py", line 873, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/usr/local/lib/python3.9/dist-packages/lxml/html/__init__.py", line 761, in document_fromstring
    raise etree.ParserError(
lxml.etree.ParserError: Document is empty

由于特征不明显,我暂时无法确认是否是 CF 的 403 引起,但我尝试了下述代码

url="https://ascii2d.net/search/url/http://5b0988e595225.cdn.sohucs.com/images/20200109/74e33947a41248839725d6c8d54540e4.jpeg"
headers= {'User-Agent': 'PostmanRuntime/7.29.0'}
payload = {}
scraper = cloudscraper.create_scraper()
response1 = scraper.get(url, headers=headers, data = payload)
response2 = requests.request("GET", url, headers=headers, data = payload)

目前 response1response2 均返回 403

kitUIN commented 3 months ago

目前确实存在403状态

wlt233 commented 3 months ago

所以因为 cf 的 waf,整个 ascii2d 接口不可用了吗?有没有考虑引入 selenium 曲线救国一下?

kitUIN commented 3 months ago

目前还在思考解决方案🤔

wlt233 commented 3 months ago

测试了一下,cf 似乎会检查 tls 指纹,可以考虑使用 curl_cffi 去模拟浏览器请求:

from curl_cffi import requests
r = requests.get("https://ascii2d.net/search/url/" + url, impersonate="chrome101")

ref: How to issue a web request to simulate browser (Namely the TLS handshake / client hello?)

NekoAria commented 2 months ago

测试了一下,cf 似乎会检查 tls 指纹,可以考虑使用 curl_cffi 去模拟浏览器请求:

from curl_cffi import requests
r = requests.get("https://ascii2d.net/search/url/" + url, impersonate="chrome101")

ref: How to issue a web request to simulate browser (Namely the TLS handshake / client hello?)

这个方式还需要引入额外的库和相应的重构,不打算考虑采用。 selenium 就过重了,更不可能采用。

不过,会触发这个和网络环境有关。 我已经很久没遇到过了。

可以接受给所有模块加上 base_url 的方案。

Container-Zero commented 2 months ago

测试了一下,cf 似乎会检查 tls 指纹,可以考虑使用 curl_cffi 去模拟浏览器请求:

from curl_cffi import requests
r = requests.get("https://ascii2d.net/search/url/" + url, impersonate="chrome101")

ref: How to issue a web request to simulate browser (Namely the TLS handshake / client hello?)

这个方式还需要引入额外的库和相应的重构,不打算考虑采用。 selenium 就过重了,更不可能采用。

不过,会触发这个和网络环境有关。 我已经很久没遇到过了。

可以接受给所有模块加上 base_url 的方案。

有一个小小的规范期望,如果打算使用 base_url 希望最后实现的时候,base_url 能统一为不带路由的二级域名,如 https://www.baidu.com 而非 https://www.baidu.com/route ,目前谷歌要带一个 search 的路由,这个当然无伤大雅,但如果最后实现的时候,每个 base_url 的输入如果参差不齐还挺怪的。

NekoAria commented 2 months ago

测试了一下,cf 似乎会检查 tls 指纹,可以考虑使用 curl_cffi 去模拟浏览器请求:

from curl_cffi import requests
r = requests.get("https://ascii2d.net/search/url/" + url, impersonate="chrome101")

ref: How to issue a web request to simulate browser (Namely the TLS handshake / client hello?)

这个方式还需要引入额外的库和相应的重构,不打算考虑采用。 selenium 就过重了,更不可能采用。 不过,会触发这个和网络环境有关。 我已经很久没遇到过了。 可以接受给所有模块加上 base_url 的方案。

有一个小小的规范期望,如果打算使用 base_url 希望最后实现的时候,base_url 能统一为不带路由的二级域名,如 https://www.baidu.com 而非 https://www.baidu.com/route ,目前谷歌要带一个 search 的路由,这个当然无伤大雅,但如果最后实现的时候,每个 base_url 的输入如果参差不齐还挺怪的。

这个没问题。

Container-Zero commented 2 months ago

测试了一下,cf 似乎会检查 tls 指纹,可以考虑使用 curl_cffi 去模拟浏览器请求:

from curl_cffi import requests
r = requests.get("https://ascii2d.net/search/url/" + url, impersonate="chrome101")

ref: How to issue a web request to simulate browser (Namely the TLS handshake / client hello?)

试了下,集成 curl_cffi 还挺简单的,基本向下兼容各类主流请求,改 4 行 network.py 就行,我就不PR了,代码贴一下:

from collections import namedtuple
from types import TracebackType
from typing import Any, Dict, Optional, Type, Union

# from httpx import AsyncClient, QueryParams
from httpx import  QueryParams
from curl_cffi.requests import AsyncSession as AsyncClient # 导入 curl_cffi AsyncSession 设别名兼容已有代码

DEFAULT_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/99.0.4844.82 Safari/537.36"
    )
}
RESP = namedtuple("RESP", ["text", "url", "status_code"])

class Network:
    """Manages HTTP client for network operations.

    Attributes:
        internal: Indicates if the object manages its own client lifecycle.
        cookies: Dictionary of parsed cookies, provided in string format upon initialization.
        client: Instance of an HTTP client.
    """

    def __init__(
        self,
        internal: bool = False,
        proxies: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        cookies: Optional[str] = None,
        timeout: float = 30,
        verify_ssl: bool = True,
    ):
        """Initializes Network with configuration for HTTP requests.

        Args:
            internal: If True, Network manages its own HTTP client lifecycle.
            proxies: Proxy settings for the HTTP client.
            headers: Custom headers for the HTTP client.
            cookies: Cookies in string format for the HTTP client.
            timeout: Timeout duration for the HTTP client.
            verify_ssl: If True, verifies SSL certificates.
        """
        self.internal: bool = internal
        headers = {**DEFAULT_HEADERS, **headers} if headers else DEFAULT_HEADERS
        self.cookies: Dict[str, str] = {}
        if cookies:
            for line in cookies.split(";"):
                key, value = line.strip().split("=", 1)
                self.cookies[key] = value

        self.client: AsyncClient = AsyncClient(
            headers=headers,
            cookies=self.cookies,
            verify=verify_ssl,
            proxies=proxies,
            timeout=timeout,
            # follow_redirects=True,
            allow_redirects=True, # 修改为 requests 标准
            impersonate="chrome120" # 模拟 chrome
        )

    def start(self) -> AsyncClient:
        """Initializes and returns the HTTP client.

        Returns:
            AsyncClient: Initialized HTTP client for network operations.
        """
        return self.client

    async def close(self) -> None:
        """Closes the HTTP client session if managed internally."""
        # await self.client.aclose()
        await self.client.close() # 修改为 requests 标准

    async def __aenter__(self) -> AsyncClient:
        """Async context manager entry for initializing or returning the HTTP client.

        Returns:
            AsyncClient: The HTTP client instance.
        """
        return self.client

    async def __aexit__(
        self,
        exc_type: Optional[Type[BaseException]] = None,
        exc_val: Optional[BaseException] = None,
        exc_tb: Optional[TracebackType] = None,
    ) -> None:
        """Async context manager exit for closing the HTTP client if managed internally."""
        # await self.client.aclose() 
        await self.client.close() # 修改为 requests 标准

#之后的代码不用动

但是和我预料的差不多,ja3 修改后也没办法阻止我这边 403 发生,我以前也遇到过ja3墙的问题,但ascii2d 的防护策略似乎比想象的要更麻烦(我试着拿 cf_clearance 也不行,或许要稳定适应的话得用无头浏览器绕过...?饶黑盒真的麻烦),目前唯一确认的是修改 base_url 为反代站绝对可以百分百解决 403 问题