psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
52.2k stars 9.34k forks source link

POST request works in Postman/cURL, but not in Requests #5003

Closed mukundt closed 3 years ago

mukundt commented 5 years ago

I have a POST request that works perfectly with both Postman an cURL (it returns a JSON blob of data). However, when I perform the exact same request with Python's Requests library, I get a 200 success response, but instead of my JSON blob, I get this:

<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3">
</script>
<body>
</body></html>

I've used HTTP request bins to verify that the request (headers and payload) from Postman/cURL is exactly the same as the one from Python Requests.

Here is my Postman request in cURL:

curl -X POST \
  https:/someurl/bla/bla \
  -H 'Content-Type: application/json' \
  -H 'Referer: https://www.host.com/bla/bla/' \
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:65.0) Gecko/20100101 Firefox/65.0' \
  -H 'cache-control: no-cache' \
  -d '{"json1":"blabla","etc":"etc"}'

...and here is my Python code:

payload = {
      "json1": "blabla",
      "etc": "etc",
    }

    headers = {
        'Host': 'www.host.com',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36',
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate, br',
        'Referer': 'https://www.host.com/bla/bla/', 
        'Content-Type':'application/json',
        'X-Requested-With': 'XMLHttpRequest',
        'Connection': 'keep-alive',
        'Origin': 'https://www.host.com',
    }

    s = requests.Session()
    response_raw = s.post(url, json=payload, headers=headers)
    print(response_raw)
    print(response_raw.text)

I have verified that the payload and headers are correct and valid. I don't think it's a cookies or redirect issue, since I've disabled both of those params with Postman/cURL and everything still works fine. I'm stymied how the host server is somehow able to tell the difference between two seemingly identical HTTP requests...

Any help would be much appreciated; thanks!

mukundt commented 5 years ago

This weird behavior has been reproduced. See comment thread on StackOverfow.

alwaysnotworkingforme commented 5 years ago

It seems someone (if not all) in requests team make this happened on purpose. It also happened in my post.

The only logical explaining is: Requests do not want handle header in order, So that someone can sell product or code that will detect requests very easily. Or they just do that for the government.

qq292 commented 5 years ago

hi. if return status is 200 , Explain:your python code and url of request all no error, os your request perhaps is Cross domain request, You need to set the permission of the web server to this domain or it will trigger the browser's homology policy. Key word: CORS

Aron2560 commented 5 years ago

Can you print the response headers? print(response_raw.headers) I suspect it has to do with 'requests' not being able to decode content encoded in 'br'

dchoub commented 5 years ago

I am having same issue where I am able to access the GET API through postman and browser but while trying with request module getting below error Error Message: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))

Python 3.7.4 requests==2.22.0

and using below code response = requests.get('url') print(response)

** checked that in postman not setting any request header.

jbaiad commented 5 years ago

I'm encountering this issue as well with requests.__version__ == '2.22.0' and python 3.7.4. This is the corresponding minimal cURL request:

curl 'https://hearst.referrals.selectminds.com/ajax/content/landingpage_job_results?JobSearch.id=5917052&page_index=5&uid=173' \
-X POST \
--compressed \
-H 'Cookie: ORA_OTSS_SESSION_ID=073b162852b14d23b7ba84488fcec034b6b7d6896e3b77d291f35ddc54c743fb.hearst.chprapu11411.tee.taleocloud.net; JSESSIONID=B621ADEBAC95CA9D65C83B4B51220248.TC_832943_832940' \
-H 'tss-token: 3hs5eXc7xtNpNz50t8iJx2twIhPZeht/t1npR5q1CSo=' 

I've attempted all means of passing the Cookie header (e.g., passing it in as a dictionary, using sessions and it's internal CookieJar object, maintaining my own CookieJar instance, etc.) and it seems as though requests isn't passing the headers along correctly.

Also, hello fellow CMU '17 alum @mukundt—I think we took a few classes together 😄

remiyazigi commented 5 years ago

any update on that? i have the same issue my request was working previously then it stopped. and i can curl the same request.

pvvanilkumar commented 4 years ago

Even i see this issue. pls let us know the status of this.

any logs of info needed please dont hesitate to ask.

Azhrei commented 4 years ago

Any other advancement on this? I've got the same issue (wget and curl and Python scripts fail, but browser works) as described in the stack overflow thread (linked above).

All request headers appear to be the same (the browser reports a half-dozen were used and I copy/pasted their text values into wget and curl options, but no luck; requests module had the same result).

Remember the ghosts in Matrix 2? "We are started to get annoyed." "Yes, we are." 😐

Aron2560 commented 4 years ago

@Azhrei Is this something you can share publicly? I would love to take on the challenge to solve it for you.

Azhrei commented 4 years ago

Wow, I'd be happy to have another set of eyeballs looking at this. (Mine are getting a bit bloodshot at this point. 😉)

The public URL is https://www.sunpass.com/ (the Florida electronic toll system) and that URL redirects to https://www.sunpass.com/en/home/index.shtml

I've tried multiple approaches; this is the Bash script I've been using to try to replicate the request the browser makes:

#!/bin/bash

curl -q \
    -v \
    -o x.html \
    -i \
    -k \
    -D x.headers \
    --cookie-jar x.cookies \
    -H "Connection: keep-alive" \
    -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" \
    -H "Accept-Language: en-US" \
    -H "Accept-Encoding: br,gzip,deflate" \
    -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15" \
    https://www.sunpass.com/en/home/index.shtml

exit

wget -O x.html \
    --rejected-log=/dev/stdout \
    --save-headers \
    --load-cookies x.cookies \
    --save-cookies x.cookies \
    --keep-session-cookies \
    --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" \
    --header="Accept-Language: en-US" \
    --header="Accept-Encoding: br,gzip,deflate" \
    -U "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15" \
    https://www.sunpass.com/en/home/index.shtml

I've been working with curl lately, but comment out those first lines (including the exit) and it'll test with wget instead.

I'm out of ideas. My next step is going to be to install a proxy server (Squid, probably) so that I can see everything. (I've tried using tcpdump/wireshark but the connection is HTTPS so I can't see anything. I'm not sure Squid will help for that reason; can I make a TLS connection to Squid, then have Squid make a TLS connection to the destination, such that I can see the unencrypted data as it passes through the proxy?)

Thanks for taking a look!

Edit: I should've included the Python code that does the same thing. I'm actually using requests_html as a front-end to requests because the page contains JavaScript. So the Python code is essentially:

from requests_html import HTMLSession

session = HTMLSession()
r = session.get(URL, headers={...}, timeout=10)
r.html.render()

URL and the headers are the same as above, in the script.

eruslan44 commented 4 years ago

Hello! Did you find a solution to this problem?

Azhrei commented 4 years ago

I have stalled on this; no progress since I made the above post. Right now, I'm manually traversing the target web site and downloading the data I need myself, but automating this task is still high on my todo list, so any progress others are making is still of great interest to me!

Aron2560 commented 4 years ago

I succeeded to get content from the their web-server. BUT! They have a clever human verification built into their website. So when I scrape the content via script I'm presented with a recaptcha instead of the actual data. I don't think there is a way around this. This has nothing to do with curl or for that matter with any script language or scraper. It's purely related to the website in question.

If anyone experiences an issue with a different website, I would love to help.

Attached is the page I get from their website with your commend: Response.txt

Azhrei commented 4 years ago

Yep, that's the same stuff I get.

Any idea what their "robot detection" is looking for? I thought it might've been something to do with the timing of requests, but after some tests, I don't think that's it. What has stumped me is that they return that bogus data on the first request; it's not like they ping the browser and ask for it to run some JS or something...

What code do you have that returns the captcha request? Or is that the page that the src attribute that the returned data shows...?

Azhrei commented 4 years ago

So, I followed your lead and went down the rabbit hole.

The page that the script refers to is always delivered gzip-compressed (apparently). It's ~24K that uncompresses into 169K of "hidden" JS code.

That script is executed and generates another chunk of JS that is ~108K.

Then that code (which is full of text that is written using \x hex escapes) is down to ~74K when I've replaced the hex-encoded text with straight ASCII.

This code further contains what appears to be base64-encoded strings that are again decoded... The variable and function names in this code are all obfuscated by converting them to hash strings.

This is where I'm at now. I'll report again when I've made some progress on the functional blocks of this code.

sethmlarson commented 4 years ago

@Azhrei It sounds like the website is not making content available unless JavaScript execution is available. Have you tried using Selenium to drive a real browser instead of an HTTP client?

Azhrei commented 4 years ago

Yes, that's clearly what they're attempting. No idea why, but whatever...

I mentioned above that I'm not actually using requests, but requests_html. This is a front-end module that downloads the Chromium JS engine on first use so that JS can be rendered in the page. I've used it on a couple of test URLs an it seems to work well, so I've got the JS part worked out. 🙂

Continuing my last post, I've hit a snag. When I create a page and put my "resolved" JS into it, it gets into some kind of weird loop and blows the JS stack in the browser. I'm guessing it's because I'm loading it from a file:// URL so the function that's trying to set a cookie is failing in a weird way. Clearly, there is still some work to do...

sviazovska commented 4 years ago

had the same issue worked for me to replace requests.post with requests.request:

response = requests.request("POST", url, headers=headers, data = payload)
print(response.text.encode('utf8'))
kbeauregard commented 4 years ago

Yes, that's clearly what they're attempting. No idea why, but whatever...

To stop bots, not everyone wants to be crawled

Azhrei commented 4 years ago

It’s a ridiculously complex approach, though. Especially given that the page being discussed is the home landing page and the useful (and extensive) data is available only after authenticating.

But it’s Florida, y’know. We build our web sites the way we handle our voting processes — badly. 🙄🥺

I still haven’t figured out how it prevents bots either. If I render the page using a JS engine, how is a crawler being prevented...? No time to dig into it more right now, though...

Azhrei commented 4 years ago

I succeeded to get content from the their web-server. BUT! They have a clever human verification built into their website. So when I scrape the content via script I'm presented with a recaptcha instead of the actual data. I don't think there is a way around this. This has nothing to do with curl or for that matter with any script language or scraper. It's purely related to the website in question.

Attached is the page I get from their website with your commend: Response.txt

Yep, that's what I see as well.

Can you describe what you figured out and/or how you were able to get the page content? As I mentioned above, my next step is going to be to proxy the site somehow, as don't see any web traffic difference between what my browser does and what my script does?!?

Thanks.

RandallLasini commented 4 years ago

okay - hope this helps (yes I am impacted by this problem as well but don't have spare time from my regular job to help).

with regards to the test URL provided up (www.sunpass.com) here's something that might help.

This site is hosted behind incapsula cloud WAF's. and they have a few bot protection functions, including injecting recaptcha's, but they also have a bot/automation protection mechanism of malformed Cookies with a CRLF (most browsers handle this fine, most automation tools (eg python/java/etc) break on this cookie).

Don't know if this will help in anyway.

Azhrei commented 4 years ago

Thanks, I’ll look into that (next week, when I’ll have some time).

I went down the rabbit hole trying to decode their silly JavaScript obfuscation and found some weird stuff, but something like a malformed cookie didn’t occur to me. I’ll also do some web searches on them to see what else I can find (I saw the their name in the initial response but my Google fu must’ve been weak when I looked for them).

Thank you!

kousthubasqi commented 4 years ago

I had a similar issue. Got away with urllib3.

Azhrei commented 4 years ago

Really? Wow, that'd be too easy, but I'll give it a shot as well. It couldn't hurt. 🤷‍♂️

phani653 commented 4 years ago

convert the response into json resp = requests.post(....) print(resp.json())

Premalatha-Github commented 4 years ago

I succeeded to get content from the their web-server. BUT! They have a clever human verification built into their website. So when I scrape the content via script I'm presented with a recaptcha instead of the actual data. I don't think there is a way around this. This has nothing to do with curl or for that matter with any script language or scraper. It's purely related to the website in question.

If anyone experiences an issue with a different website, I would love to help.

Attached is the page I get from their website with your commend: Response.txt

Hi Aron,

Could you please help , I am facing same issue.

thanks

Aron2560 commented 4 years ago

Hi Aron,

Could you please help , I am facing same issue.

thanks

Show me your work, and I'll try to take it from there. What are you trying to do? (Which website?)

AlexeyFreelancer commented 4 years ago

Hi there CC: @Aron2560

I have the same issue. It can be easily tested on the https://www.ozon.ru:

It works perfectly via curl:

curl -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36" 'https://www.ozon.ru'

Response: a long content

But doesn't work via code (I use Apache HttpClient fluent, java):

Request.get("https://www.ozon.ru")
    .addHeader(HttpHeaders.USER_AGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36")
    .execute()
    .returnContent()
    .toString()

Response is:

<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3"></script></head><body style="margin:0px;height:100%"><iframe id="main-iframe" src="/_Incapsula_Resource?SWUDNSAI=30&xinfo=2-1431547-0%200NNN%20RT%281591636163716%2088%29%20q%280%20-1%20-1%200%29%20r%280%20-1%29%20B12%2811%2c348807%2c0%29%20U18&incident_id=580000270002950602-5855638190294722&edet=12&cinfo=0b000000&rpinfo=0" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 580000270002950602-5855638190294722</iframe></body></html>

It looks really strange. I tried with totally the same headers that send curl but it also doesn't work.


Request.get("https://www.ozon.ru")
    .addHeader(HttpHeaders.USER_AGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36")
    .addHeader(HttpHeaders.ACCEPT, "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9")
    .addHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate")
    .addHeader(HttpHeaders.ACCEPT_LANGUAGE, "en-US,en;q=0.9,ru-RU;q=0.8,ru;q=0.7")
    .addHeader(HttpHeaders.CONNECTION, "keep-alive")
    .addHeader("Upgrade-Insecure-Requests", "1")
    .execute()
    .returnContent()
    .toString()

I'm confused and exhausted. Any ideas from you guys?

Kind regards, Alexey

Aron2560 commented 4 years ago

Hello @AlexeyFreelancer, I got mixed results from the curl command you posted. Running the very same commend I got the robots whatever on the first run and the long content on the second run. The "robots" detectors nowadays are becoming smarter by the day, and I'm not in a pursuit to outsmart them. My specialty is to debug and find error where something is not working as it should. In your case, everything works as it should. The only thing is that sometimes you're being detected as a robot, while other times you slip through the cracks.

nateprewitt commented 3 years ago

There are several unrelated questions going on in this thread. To answer the original question, you're being sent the servers robot.txt because they've detected you're crawling and are instructing you to stop. This is not a defect in Requests.