Closed Container-Zero closed 2 months ago
追加一下请求: 希望其它接口都能加上此参数,可以避免网络问题或实现负载均衡 如saucenao本身有每天针对同一IP的请求数量以及速度限制(即使用了Token),通过搭建复数镜像站可以实现负载均衡,以实现开放式公共查询API之类的服务
对了我还是说一下,我提这个 issue 的动机我判断目前 ASCII2D 遇到了 CF 的拦截,但仔细想了想我并没有办法完全确定这一点,以下是我这边遇到的 ASCII2D 的具体报错:
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.9/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.9/dist-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.9/dist-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.9/dist-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/usr/local/lib/python3.9/dist-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.9/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.9/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.9/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 758, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 778, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 299, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 79, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.9/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.9/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 74, in app
response = await func(request)
File "/usr/local/lib/python3.9/dist-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.9/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/app/main.py", line 83, in ascii2d
resp = await ascii2d.search(url=url)
File "/usr/local/lib/python3.9/dist-packages/PicImageSearch/ascii2d.py", line 65, in search
return Ascii2DResponse(resp.text, resp.url)
File "/usr/local/lib/python3.9/dist-packages/PicImageSearch/model/ascii2d.py", line 144, in __init__
data = PyQuery(fromstring(resp_text, parser=utf8_parser))
File "/usr/local/lib/python3.9/dist-packages/lxml/html/__init__.py", line 873, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/local/lib/python3.9/dist-packages/lxml/html/__init__.py", line 761, in document_fromstring
raise etree.ParserError(
lxml.etree.ParserError: Document is empty
由于特征不明显,我暂时无法确认是否是 CF 的 403
引起,但我尝试了下述代码
url="https://ascii2d.net/search/url/http://5b0988e595225.cdn.sohucs.com/images/20200109/74e33947a41248839725d6c8d54540e4.jpeg"
headers= {'User-Agent': 'PostmanRuntime/7.29.0'}
payload = {}
scraper = cloudscraper.create_scraper()
response1 = scraper.get(url, headers=headers, data = payload)
response2 = requests.request("GET", url, headers=headers, data = payload)
目前 response1
与 response2
均返回 403
目前确实存在403
状态
所以因为 cf 的 waf,整个 ascii2d 接口不可用了吗?有没有考虑引入 selenium 曲线救国一下?
目前还在思考解决方案🤔
测试了一下,cf 似乎会检查 tls 指纹,可以考虑使用 curl_cffi
去模拟浏览器请求:
from curl_cffi import requests
r = requests.get("https://ascii2d.net/search/url/" + url, impersonate="chrome101")
ref: How to issue a web request to simulate browser (Namely the TLS handshake / client hello?)
测试了一下,cf 似乎会检查 tls 指纹,可以考虑使用
curl_cffi
去模拟浏览器请求:from curl_cffi import requests r = requests.get("https://ascii2d.net/search/url/" + url, impersonate="chrome101")
ref: How to issue a web request to simulate browser (Namely the TLS handshake / client hello?)
这个方式还需要引入额外的库和相应的重构,不打算考虑采用。 selenium 就过重了,更不可能采用。
不过,会触发这个和网络环境有关。 我已经很久没遇到过了。
可以接受给所有模块加上 base_url 的方案。
测试了一下,cf 似乎会检查 tls 指纹,可以考虑使用
curl_cffi
去模拟浏览器请求:from curl_cffi import requests r = requests.get("https://ascii2d.net/search/url/" + url, impersonate="chrome101")
ref: How to issue a web request to simulate browser (Namely the TLS handshake / client hello?)
这个方式还需要引入额外的库和相应的重构,不打算考虑采用。 selenium 就过重了,更不可能采用。
不过,会触发这个和网络环境有关。 我已经很久没遇到过了。
可以接受给所有模块加上 base_url 的方案。
有一个小小的规范期望,如果打算使用 base_url 希望最后实现的时候,base_url 能统一为不带路由的二级域名,如 https://www.baidu.com
而非 https://www.baidu.com/route
,目前谷歌要带一个 search 的路由,这个当然无伤大雅,但如果最后实现的时候,每个 base_url 的输入如果参差不齐还挺怪的。
测试了一下,cf 似乎会检查 tls 指纹,可以考虑使用
curl_cffi
去模拟浏览器请求:from curl_cffi import requests r = requests.get("https://ascii2d.net/search/url/" + url, impersonate="chrome101")
ref: How to issue a web request to simulate browser (Namely the TLS handshake / client hello?)
这个方式还需要引入额外的库和相应的重构,不打算考虑采用。 selenium 就过重了,更不可能采用。 不过,会触发这个和网络环境有关。 我已经很久没遇到过了。 可以接受给所有模块加上 base_url 的方案。
有一个小小的规范期望,如果打算使用 base_url 希望最后实现的时候,base_url 能统一为不带路由的二级域名,如
https://www.baidu.com
而非https://www.baidu.com/route
,目前谷歌要带一个 search 的路由,这个当然无伤大雅,但如果最后实现的时候,每个 base_url 的输入如果参差不齐还挺怪的。
这个没问题。
测试了一下,cf 似乎会检查 tls 指纹,可以考虑使用
curl_cffi
去模拟浏览器请求:from curl_cffi import requests r = requests.get("https://ascii2d.net/search/url/" + url, impersonate="chrome101")
ref: How to issue a web request to simulate browser (Namely the TLS handshake / client hello?)
试了下,集成 curl_cffi
还挺简单的,基本向下兼容各类主流请求,改 4 行 network.py
就行,我就不PR了,代码贴一下:
from collections import namedtuple
from types import TracebackType
from typing import Any, Dict, Optional, Type, Union
# from httpx import AsyncClient, QueryParams
from httpx import QueryParams
from curl_cffi.requests import AsyncSession as AsyncClient # 导入 curl_cffi AsyncSession 设别名兼容已有代码
DEFAULT_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/99.0.4844.82 Safari/537.36"
)
}
RESP = namedtuple("RESP", ["text", "url", "status_code"])
class Network:
"""Manages HTTP client for network operations.
Attributes:
internal: Indicates if the object manages its own client lifecycle.
cookies: Dictionary of parsed cookies, provided in string format upon initialization.
client: Instance of an HTTP client.
"""
def __init__(
self,
internal: bool = False,
proxies: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
cookies: Optional[str] = None,
timeout: float = 30,
verify_ssl: bool = True,
):
"""Initializes Network with configuration for HTTP requests.
Args:
internal: If True, Network manages its own HTTP client lifecycle.
proxies: Proxy settings for the HTTP client.
headers: Custom headers for the HTTP client.
cookies: Cookies in string format for the HTTP client.
timeout: Timeout duration for the HTTP client.
verify_ssl: If True, verifies SSL certificates.
"""
self.internal: bool = internal
headers = {**DEFAULT_HEADERS, **headers} if headers else DEFAULT_HEADERS
self.cookies: Dict[str, str] = {}
if cookies:
for line in cookies.split(";"):
key, value = line.strip().split("=", 1)
self.cookies[key] = value
self.client: AsyncClient = AsyncClient(
headers=headers,
cookies=self.cookies,
verify=verify_ssl,
proxies=proxies,
timeout=timeout,
# follow_redirects=True,
allow_redirects=True, # 修改为 requests 标准
impersonate="chrome120" # 模拟 chrome
)
def start(self) -> AsyncClient:
"""Initializes and returns the HTTP client.
Returns:
AsyncClient: Initialized HTTP client for network operations.
"""
return self.client
async def close(self) -> None:
"""Closes the HTTP client session if managed internally."""
# await self.client.aclose()
await self.client.close() # 修改为 requests 标准
async def __aenter__(self) -> AsyncClient:
"""Async context manager entry for initializing or returning the HTTP client.
Returns:
AsyncClient: The HTTP client instance.
"""
return self.client
async def __aexit__(
self,
exc_type: Optional[Type[BaseException]] = None,
exc_val: Optional[BaseException] = None,
exc_tb: Optional[TracebackType] = None,
) -> None:
"""Async context manager exit for closing the HTTP client if managed internally."""
# await self.client.aclose()
await self.client.close() # 修改为 requests 标准
#之后的代码不用动
但是和我预料的差不多,ja3
修改后也没办法阻止我这边 403
发生,我以前也遇到过ja3
墙的问题,但ascii2d
的防护策略似乎比想象的要更麻烦(我试着拿 cf_clearance
也不行,或许要稳定适应的话得用无头浏览器绕过...?饶黑盒真的麻烦),目前唯一确认的是修改 base_url
为反代站绝对可以百分百解决 403
问题
类似于
google可以通过自定义base_url来选择镜像源,希望ascii2d也可以 目的是通过自建ascii2d反代站点到安全环境,来永久避开cf的爬虫检测,一劳永逸