Closed stanislav-milchev closed 8 months ago
What specific error have you encountered? The url is also needed to reproduce the issue. If you can not share it in public, please send me an email or DM me on telegram.
What specific error have you encountered? The url is also needed to reproduce the issue. If you can not share it in public, please send me an email or DM me on telegram.
Its no url in particular. It happens during webcrawling of https://www.thebay.com/ on random requests. The error encountered is the Curl (18) error (transfer closed with outstanding read data remaining).
Does this thread help you?
I've tested the HTTP1 solution, which doesnt work, but then again the implementation i have with curl-impersonate works with the same settings as the cffi one (unless there's something else set on the background that I miss)
Did you try removing the Content-Length
header? Please elaborate on both of your settings, otherwise we are in a gussing game.
Yes, I remove the content-length, since then it would return content and content-length headers that mismatch and cause starlette/uvicorn error. The "settings" are up there in my thread. Nothing out of the ordinary with setting curl options. I have set a timeout, allowredirect and proxy (occasionally, that gets controlled from scrapy).
Also is there a way to get the content of the requests not decompressing? This is what I have to do with my responses every time
if content_encoding := response.headers.get('content-encoding', None):
if content_encoding.lower() == 'gzip':
body = gzip.compress(body)
content_length = str(len(body))
elif content_encoding.lower() == 'deflate':
body = zlib.compress(body)
content_length = str(len(body))
elif content_encoding.lower() == 'br':
body = brotli.compress(body)
content_length = str(len(body))
else:
content_length = str(len(body))
# fix the content-length
response.headers['content-length'] = content_length
By settings, I mean headers you set in curl_cffi and curl-impersonate.
Are you getting errors in your starlette app or the client connecting your starlette app?
I'm not sure why you would recompress the content, this is almost always the reverse proxy's, e.g. nginx's, job. Do not do that in an application server.
Do I need not to send a cookie
header and instead transform that and plug it in as a cookie param in the AsyncSession request, as it is currently sent in the headers?
No need to do that.
I'm still not clear about how you passed you requests information from the starlette handler to curl_cffi, which I think is the key to your problem.
No need to do that.
I'm still not clear about how you passed you requests information from the starlette handler to curl_cffi, which I think is the key to your problem.
async def curlify(request: Request):
# remove default browser simulated headers from the incoming request headers
request_headers = {k: v for k, v in request.headers.items() if k not in DEFAULT_BROWSER_HEADERS}
# get request url from headers
if not (req_url := request_headers.pop("tls-url", None)):
raise HTTPException(detail='"tls-url" header is missing!', status_code=400)
# get version from headers, defaulting to Chrome latest
browser = request_headers.pop("tls-browser", DEFAULT_BROWSER)
# get timeout and followredirect
timeout = float(request_headers.pop('tls-timeout', '60'))
allow_redirects = True if request_headers.pop('tls-allowredirect', 'false') == 'true' else False
proxy = request_headers.pop('tls-proxy', None)
# get request body
if request.method == "POST" :
request_body = await request.json()
else:
request_body = None
# 400 Bad requests because of this request header
request_headers.pop('content-length', None)
This is the code that's missing before the AsyncSession (up here in the original post). I have been testing around all morning and I think the issue is gone, however there is nothing that seems changed. Maybe it was a weird site issue happening for a few days.
Describe the bug Hello,
I have setup a starlette webserver that accepts a request and takes information from the headers to make a curl request. I started with the original curl-impersonate but then you added more versions and switched back to your fork of curl-impersonate (and another version to compare with curl-cffi as well). When trying to scrape a certain website (one of the few I tested both implementations with) I get a curl 18 error with the
curl-cffi
and no issues with the other implementation. I've checked other issues, one of which suggested that it was fixed when they OP went for a HTTP1.1 version, however I played around all the HTTP version options for the cffi and nothing fixed the issue.Shouldnt both work the same? I've compared fingerprints and they are pretty similar for the chrome120 version.
To Reproduce For curl-impersonate:
This is the function that gets executed by the code above:
For the curl-cffi implementation:
Versions
pip freeze
dump