libcpr / cpr

C++ Requests: Curl for People, a spiritual port of Python Requests.
https://docs.libcpr.org/
Other
6.29k stars 903 forks source link

Redirects overwrite URL #911

Open crazydef opened 1 year ago

crazydef commented 1 year ago

Description

If a request causes a redirect, the final Response object's URL is that of the redirect, not the original URL requested by the application.

Example/How to Reproduce

In this simple example:

cpr::Response r = cpr::Get(cpr::Url{"http://osocorporation.com/"});

the URL in the response object holds the string https://osocorporation.com/. Nothing particularly special in this instance, but if the server performs a more complex redirect, possibly serving custom error pages, for example, the calling application has no way of knowing what the original request was.

Possible Fix

It would be beneficial if the response retained the original URL along with any redirects.

Maybe make the response hold a vector of URLs, where the first entry is the original request, and subsequent items are the redirects?

Where did you get it from?

GitHub (branch e.g. master)

Additional Context/Your Environment

COM8 commented 1 year ago

Hi @crazydef, thanks for reporting this. I see your point there. But based on my experience using libcurl, this is not really possible (as far as I'm aware).

A quick ChatGPT question resulted in the following example how to solve it. It's not perfect but rather a crude way of solving it.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <curl/curl.h>

struct MemoryStruct {
  char *memory;
  size_t size;
};

static size_t WriteMemoryCallback(void *contents, size_t size, size_t nmemb, void *userp) {
  size_t realsize = size * nmemb;
  struct MemoryStruct *mem = (struct MemoryStruct *)userp;

  mem->memory = realloc(mem->memory, mem->size + realsize + 1);
  if(mem->memory == NULL) {
    printf("not enough memory (realloc returned NULL)\n");
    return 0;
  }

  memcpy(&(mem->memory[mem->size]), contents, realsize);
  mem->size += realsize;
  mem->memory[mem->size] = 0;

  return realsize;
}

static int DebugCallback(CURL *handle, curl_infotype type, char *data, size_t size, void *userp) {
  if(type == CURLINFO_TEXT) {
    printf("== Info: %s", data);
    if(strstr(data, "Location:")) {
      printf("Redirected to: %s", data+10);
      // You may want to store these URLs in a linked list or other data structure here.
    }
  }
  return 0;
}

int main(void) {
  CURL *curl_handle;
  CURLcode res;

  struct MemoryStruct chunk;

  chunk.memory = malloc(1);  /* will be grown as needed by the realloc above */
  chunk.size = 0;    /* no data at this point */

  curl_global_init(CURL_GLOBAL_ALL);

  /* init the curl session */
  curl_handle = curl_easy_init();

  /* specify URL to get */
  curl_easy_setopt(curl_handle, CURLOPT_URL, "http://example.com");

  /* send all data to this function  */
  curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, WriteMemoryCallback);

  /* we pass our 'chunk' struct to the callback function */
  curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, (void *)&chunk);

  /* enable verbose for easier tracing */
  curl_easy_setopt(curl_handle, CURLOPT_VERBOSE, 1L);
  curl_easy_setopt(curl_handle, CURLOPT_DEBUGFUNCTION, DebugCallback);

  /* follow redirects */
  curl_easy_setopt(curl_handle, CURLOPT_FOLLOWLOCATION, 1L);

  /* get it! */
  res = curl_easy_perform(curl_handle);

  /* check for errors */
  if(res != CURLE_OK) {
    fprintf(stderr, "curl_easy_perform() failed: %s\n", curl_easy_strerror(res));
  }

  /* cleanup curl stuff */
  curl_easy_cleanup(curl_handle);

  free(chunk.memory);

  /* we're done with libcurl, so clean it up */
  curl_global_cleanup();

  return 0;
}

For this feather we need some kind of "handler" we can register inside curl that triggers as soon as we get redirected. I will put this on the backlog.

crazydef commented 1 year ago

To be honest, I don't know how useful it would be to know every redirect. I just mentioned that as a possible solution. At the very least though, making a copy of the original URL and keeping that alongside the final URL would probably be sufficient for 99.99% of use cases.