curl-impersonate vs curl-cffi [BUG]

stanislav-milchev commented 8 months ago

Describe the bug Hello,

I have setup a starlette webserver that accepts a request and takes information from the headers to make a curl request. I started with the original curl-impersonate but then you added more versions and switched back to your fork of curl-impersonate (and another version to compare with curl-cffi as well). When trying to scrape a certain website (one of the few I tested both implementations with) I get a curl 18 error with the curl-cffi and no issues with the other implementation. I've checked other issues, one of which suggested that it was fixed when they OP went for a HTTP1.1 version, however I played around all the HTTP version options for the cffi and nothing fixed the issue.

Shouldnt both work the same? I've compared fingerprints and they are pretty similar for the chrome120 version.

To Reproduce For curl-impersonate:

# above - code that takes info from the request headers and appends 
# the curl commands to a list like ['curl_chrome120', '--compressed', '-i', '-L', website_url, etc...]

output = await asyncio.get_event_loop().run_in_executor(
    None,
    run_subprocess,
    curl_command
)

# bellow - code that splits the response into headers and body, compresses body and returns it as Starlette response

This is the function that gets executed by the code above:

async def run_subprocess(command: list):
    '''
    Function that runs the curl command through asyncio
    '''
    process = await asyncio.create_subprocess_exec(
        *command,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE
    )
    stdout, stderr = await process.communicate()
    return process.returncode, stdout, stderr

For the curl-cffi implementation:

# pick up params from the request headers from above
 async with AsyncSession(max_clients=MAX_SESSION_CONNECTIONS) as s:
        try:
            response = await s.request(
                    method=request.method,
                    url=req_url,
                    impersonate=BrowserType(browser).value,
                    proxy=proxy,
                    headers=request_headers,
                    json=request_body,
                    timeout=timeout,
                    allow_redirects=allow_redirects
                )

            # raise if status code not in [200, 400)
            if not response.ok:
                raise ResponseError()
# format respose and handle errors and raise or return bellow

Versions

OS: alpine3.17 image
curl_cffi version: 0.6.3b1 (read that only it supports alpine for now)

pip freeze dump

Brotli==1.1.0
certifi==2024.2.2
cffi==1.16.0
click==8.1.7
curl_cffi==0.6.3b1
exceptiongroup==1.2.0
h11==0.14.0
idna==3.6
pycparser==2.21
sentry-sdk==1.42.0
sniffio==1.3.0
starlette==0.37.0
typing_extensions==4.9.0
urllib3==2.2.1
uvicorn==0.27.0.post1

perklet commented 8 months ago

What specific error have you encountered? The url is also needed to reproduce the issue. If you can not share it in public, please send me an email or DM me on telegram.

stanislav-milchev commented 8 months ago

What specific error have you encountered? The url is also needed to reproduce the issue. If you can not share it in public, please send me an email or DM me on telegram.

Its no url in particular. It happens during webcrawling of https://www.thebay.com/ on random requests. The error encountered is the Curl (18) error (transfer closed with outstanding read data remaining).

perklet commented 8 months ago

Does this thread help you?

stanislav-milchev commented 8 months ago

I've tested the HTTP1 solution, which doesnt work, but then again the implementation i have with curl-impersonate works with the same settings as the cffi one (unless there's something else set on the background that I miss)

perklet commented 8 months ago

Did you try removing the Content-Length header? Please elaborate on both of your settings, otherwise we are in a gussing game.

stanislav-milchev commented 8 months ago

Yes, I remove the content-length, since then it would return content and content-length headers that mismatch and cause starlette/uvicorn error. The "settings" are up there in my thread. Nothing out of the ordinary with setting curl options. I have set a timeout, allowredirect and proxy (occasionally, that gets controlled from scrapy).

Also is there a way to get the content of the requests not decompressing? This is what I have to do with my responses every time

if content_encoding := response.headers.get('content-encoding', None):
                if content_encoding.lower() == 'gzip':
                    body = gzip.compress(body)
                    content_length = str(len(body))
                elif content_encoding.lower() == 'deflate':
                    body = zlib.compress(body)
                    content_length = str(len(body))
                elif content_encoding.lower() == 'br':
                    body = brotli.compress(body)
                    content_length = str(len(body))
            else:
                content_length = str(len(body))

            # fix the content-length
            response.headers['content-length'] = content_length

perklet commented 8 months ago

By settings, I mean headers you set in curl_cffi and curl-impersonate.

Are you getting errors in your starlette app or the client connecting your starlette app?

I'm not sure why you would recompress the content, this is almost always the reverse proxy's, e.g. nginx's, job. Do not do that in an application server.

stanislav-milchev commented 8 months ago

Do I need not to send a cookie header and instead transform that and plug it in as a cookie param in the AsyncSession request, as it is currently sent in the headers?

perklet commented 8 months ago

No need to do that.

I'm still not clear about how you passed you requests information from the starlette handler to curl_cffi, which I think is the key to your problem.

stanislav-milchev commented 8 months ago

No need to do that.

I'm still not clear about how you passed you requests information from the starlette handler to curl_cffi, which I think is the key to your problem.

async def curlify(request: Request):
    # remove default browser simulated headers from the incoming request headers
    request_headers = {k: v for k, v in request.headers.items() if k not in DEFAULT_BROWSER_HEADERS}

    # get request url from headers
    if not (req_url := request_headers.pop("tls-url", None)):
        raise HTTPException(detail='"tls-url" header is missing!', status_code=400)

    # get version from headers, defaulting to Chrome latest
    browser = request_headers.pop("tls-browser", DEFAULT_BROWSER)

    # get timeout and followredirect
    timeout = float(request_headers.pop('tls-timeout', '60'))
    allow_redirects = True if request_headers.pop('tls-allowredirect', 'false') == 'true' else False

    proxy = request_headers.pop('tls-proxy', None)

    # get request body
    if request.method == "POST" :
        request_body = await request.json()
    else:
        request_body = None

    # 400 Bad requests because of this request header
    request_headers.pop('content-length', None)

This is the code that's missing before the AsyncSession (up here in the original post). I have been testing around all morning and I think the issue is gone, however there is nothing that seems changed. Maybe it was a weird site issue happening for a few days.

lexiforest / curl_cffi

curl-impersonate vs curl-cffi [BUG] #275