libcpr / cpr

C++ Requests: Curl for People, a spiritual port of Python Requests.
6.29k stars 903 forks source link

SetUrl() does not update new URLs properly within same Session() object #1018

Open kareemrt opened 5 months ago

kareemrt commented 5 months ago


I am writing a web-scraper library with libcpr that cycles random proxies and headers on a GET request. My intended behavior is for other programs to call on force_connect() to perform different GET requests, while maintaining some information between all functions calls (e.g., Proxy / browser header variables, etc.)

When I perform a single GET request to a URL (e.g., URL 1), everything works correctly; If I perform multiple GET requests to the same URL (URL 1), everything works correctly; if I perform a single GET request to a URL (URL 1), then perform another GET request to a new URL (e.g., URL 2), the second Session.Get() call returns a response from URL 1 instead of URL 2.

This behavior can be verified with the last line of code (cout << url << " " << r.url << endl;). This prints both the url passed to the function, and the url used in the GET request.

This behavior remains whether I re-use a session object, create a new session object (i.e. remove 'static'), or omit the session and use Response objects only (though under-the-hood these seem similar as Response.Get() calls on Session).

My program uses many static variables because I want to maintain allocated memory between force_connect() calls; even if I remove static calls and re-declare variables, I encounter the same issue.

There are a lot of commented out code lines; these are potential solutions I tried (and failed with).

I am unsure why I am encountering this behavior; when I print 'session.GetFullRequestUrl()', it prints the PROPER url (URL 2) which is even stranger (it means part of the session object is updating and part of it is not).

Example/How to Reproduce

string force_connect(string url, int tries){

                      // ... (IP, header vars, objects defined, other extraneous code here)
static thread_local string proxy = "socks5h://" + info.creds + '@' + IP + ":1080";

                      // Initialize session() object, set URL
static thread_local cpr::Session session;
// session.SetUrl(url);                                  // Commented out: using string url instead of cpr::Url in SetUrl()
// session.SetOption(url);
static thread_local cpr::Url CURL;             
CURL = cpr::Url{url};

                     // Assign proxy and headers to Session() object
static thread_local cpr::Header header;
header = cpr::Header{{"user-agent", hdr}};
session.SetProxies({{"http", proxy}, {"https", proxy}});

                     // Perform Get request
// cout << session.GetFullRequestUrl() << endl; // This updates properly and prints the PROPER url (i.e. URL 2)
// static thread_local cpr::Response r = cpr::Get(cpr::Url{url}, header, cpr::Proxies{{"http", proxy}, {"https", proxy}});
static thread_local cpr::Response r = session.get();
cout << url << " " << r.url << endl;                   // url SHOULD be = r.url, but r.url is not updating (i.e., URL 1)


Possible Fix

cpr::Session::SetUrl(const Url& url); takes a passed cpr::Url object and sets the private parameter 'url_' to the reference.

It sets correctly initially (that's how it reaches URL 1), but refuses to update when the same object pointer (or an entirely new one) is passed. Even when a new session and/or cpr::url object is created, I still encounter this behavior.

Looking into Session.Get() code, it appears the underlying call is to curl_easy_perform(), which reads the URL from a libcurl flag (curl_easy_set_opt(curl, CURLOPTURL, url.c_str())) that was set in Session::prepareCommon().

I don't know why Session.url_ is not updating; maybe it is and something is wrong in libcurl's code (I can't check using a debugger because this library is meant for my main program which was written in PYTHON, and the class member is private).

Either a modification-check or a copy-by-value approach could be potential solutions.

Where did you get it from?

Other (specify in "Additional Context/Your Environment")

Additional Context/Your Environment

COM8 commented 5 months ago

@kareemrt thanks for reporting! Based on a quick test, this looks to be a multithreading issue. Perhaps not everything is declared thread local. Might be an issue with how we create curl objects. Could you try comparing the pointer of *session.GetCurlHolder() if they are actually different.

As ref. The following works in a single threaded scenario:

TEST(SessionGetTests, GetMultipleTimes1) {
    Url url{server->GetBaseUrl() + "/hello.html"};
    Session session;
    std::string expected_text{"Hello world!"};

    Response response = session.Get();
    EXPECT_EQ(expected_text, response.text);
    EXPECT_EQ(url, response.url);
    EXPECT_EQ(std::string{"text/html"}, response.header["content-type"]);
    EXPECT_EQ(200, response.status_code);
    EXPECT_EQ(ErrorCode::OK, response.error.code);

    Url url2{server->GetBaseUrl() + "/url_post.html"};
    session.SetPayload({{"x", "5"}});
    std::string expected_text2{
            "  \"x\": 5\n"

    response = session.Post();
    EXPECT_EQ(expected_text2, response.text);
    EXPECT_EQ(url2, response.url);
    EXPECT_EQ(std::string{"application/json"}, response.header["content-type"]);
    EXPECT_EQ(201, response.status_code);
    EXPECT_EQ(ErrorCode::OK, response.error.code);