Closed mukundt closed 3 years ago
This weird behavior has been reproduced. See comment thread on StackOverfow.
It seems someone (if not all) in requests team make this happened on purpose. It also happened in my post.
The only logical explaining is: Requests do not want handle header in order, So that someone can sell product or code that will detect requests very easily. Or they just do that for the government.
hi.
if return status is 200 ,
Explain:your python code and url of request all no error,
os your request perhaps is Cross domain request,
You need to set the permission of the web server to this domain or it will trigger the browser's homology policy.
Key word: CORS
Can you print the response headers? print(response_raw.headers)
I suspect it has to do with 'requests' not being able to decode content encoded in 'br'
I am having same issue where I am able to access the GET API through postman and browser but while trying with request module getting below error Error Message: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))
Python 3.7.4 requests==2.22.0
and using below code response = requests.get('url') print(response)
** checked that in postman not setting any request header.
I'm encountering this issue as well with requests.__version__ == '2.22.0'
and python 3.7.4. This is the corresponding minimal cURL request:
curl 'https://hearst.referrals.selectminds.com/ajax/content/landingpage_job_results?JobSearch.id=5917052&page_index=5&uid=173' \
-X POST \
--compressed \
-H 'Cookie: ORA_OTSS_SESSION_ID=073b162852b14d23b7ba84488fcec034b6b7d6896e3b77d291f35ddc54c743fb.hearst.chprapu11411.tee.taleocloud.net; JSESSIONID=B621ADEBAC95CA9D65C83B4B51220248.TC_832943_832940' \
-H 'tss-token: 3hs5eXc7xtNpNz50t8iJx2twIhPZeht/t1npR5q1CSo='
I've attempted all means of passing the Cookie
header (e.g., passing it in as a dictionary, using sessions and it's internal CookieJar
object, maintaining my own CookieJar
instance, etc.) and it seems as though requests isn't passing the headers along correctly.
Also, hello fellow CMU '17 alum @mukundt—I think we took a few classes together 😄
any update on that? i have the same issue my request was working previously then it stopped. and i can curl the same request.
Even i see this issue. pls let us know the status of this.
any logs of info needed please dont hesitate to ask.
Any other advancement on this? I've got the same issue (wget
and curl
and Python scripts fail, but browser works) as described in the stack overflow thread (linked above).
All request headers appear to be the same (the browser reports a half-dozen were used and I copy/pasted their text values into wget
and curl
options, but no luck; requests
module had the same result).
Remember the ghosts in Matrix 2? "We are started to get annoyed." "Yes, we are." 😐
@Azhrei Is this something you can share publicly? I would love to take on the challenge to solve it for you.
Wow, I'd be happy to have another set of eyeballs looking at this. (Mine are getting a bit bloodshot at this point. 😉)
The public URL is https://www.sunpass.com/ (the Florida electronic toll system) and that URL redirects to https://www.sunpass.com/en/home/index.shtml
I've tried multiple approaches; this is the Bash script I've been using to try to replicate the request the browser makes:
#!/bin/bash
curl -q \
-v \
-o x.html \
-i \
-k \
-D x.headers \
--cookie-jar x.cookies \
-H "Connection: keep-alive" \
-H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" \
-H "Accept-Language: en-US" \
-H "Accept-Encoding: br,gzip,deflate" \
-A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15" \
https://www.sunpass.com/en/home/index.shtml
exit
wget -O x.html \
--rejected-log=/dev/stdout \
--save-headers \
--load-cookies x.cookies \
--save-cookies x.cookies \
--keep-session-cookies \
--header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" \
--header="Accept-Language: en-US" \
--header="Accept-Encoding: br,gzip,deflate" \
-U "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15" \
https://www.sunpass.com/en/home/index.shtml
I've been working with curl
lately, but comment out those first lines (including the exit
) and it'll test with wget
instead.
I'm out of ideas. My next step is going to be to install a proxy server (Squid, probably) so that I can see everything. (I've tried using tcpdump
/wireshark
but the connection is HTTPS so I can't see anything. I'm not sure Squid will help for that reason; can I make a TLS connection to Squid, then have Squid make a TLS connection to the destination, such that I can see the unencrypted data as it passes through the proxy?)
Thanks for taking a look!
Edit: I should've included the Python code that does the same thing. I'm actually using requests_html
as a front-end to requests
because the page contains JavaScript. So the Python code is essentially:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get(URL, headers={...}, timeout=10)
r.html.render()
URL
and the headers are the same as above, in the script.
Hello! Did you find a solution to this problem?
I have stalled on this; no progress since I made the above post. Right now, I'm manually traversing the target web site and downloading the data I need myself, but automating this task is still high on my todo list, so any progress others are making is still of great interest to me!
I succeeded to get content from the their web-server. BUT! They have a clever human verification built into their website. So when I scrape the content via script I'm presented with a recaptcha instead of the actual data. I don't think there is a way around this. This has nothing to do with curl or for that matter with any script language or scraper. It's purely related to the website in question.
If anyone experiences an issue with a different website, I would love to help.
Attached is the page I get from their website with your commend: Response.txt
Yep, that's the same stuff I get.
Any idea what their "robot detection" is looking for? I thought it might've been something to do with the timing of requests, but after some tests, I don't think that's it. What has stumped me is that they return that bogus data on the first request; it's not like they ping the browser and ask for it to run some JS or something...
What code do you have that returns the captcha request? Or is that the page that the src
attribute that the returned data shows...?
So, I followed your lead and went down the rabbit hole.
The page that the script
refers to is always delivered gzip-compressed (apparently). It's ~24K that uncompresses into 169K of "hidden" JS code.
That script is executed and generates another chunk of JS that is ~108K.
Then that code (which is full of text that is written using \x
hex escapes) is down to ~74K when I've replaced the hex-encoded text with straight ASCII.
This code further contains what appears to be base64-encoded strings that are again decoded... The variable and function names in this code are all obfuscated by converting them to hash strings.
This is where I'm at now. I'll report again when I've made some progress on the functional blocks of this code.
@Azhrei It sounds like the website is not making content available unless JavaScript execution is available. Have you tried using Selenium to drive a real browser instead of an HTTP client?
Yes, that's clearly what they're attempting. No idea why, but whatever...
I mentioned above that I'm not actually using requests
, but requests_html
. This is a front-end module that downloads the Chromium JS engine on first use so that JS can be rendered in the page. I've used it on a couple of test URLs an it seems to work well, so I've got the JS part worked out. 🙂
Continuing my last post, I've hit a snag. When I create a page and put my "resolved" JS into it, it gets into some kind of weird loop and blows the JS stack in the browser. I'm guessing it's because I'm loading it from a file://
URL so the function that's trying to set a cookie is failing in a weird way. Clearly, there is still some work to do...
had the same issue worked for me to replace requests.post with requests.request:
response = requests.request("POST", url, headers=headers, data = payload)
print(response.text.encode('utf8'))
Yes, that's clearly what they're attempting. No idea why, but whatever...
To stop bots, not everyone wants to be crawled
It’s a ridiculously complex approach, though. Especially given that the page being discussed is the home landing page and the useful (and extensive) data is available only after authenticating.
But it’s Florida, y’know. We build our web sites the way we handle our voting processes — badly. 🙄🥺
I still haven’t figured out how it prevents bots either. If I render the page using a JS engine, how is a crawler being prevented...? No time to dig into it more right now, though...
I succeeded to get content from the their web-server. BUT! They have a clever human verification built into their website. So when I scrape the content via script I'm presented with a recaptcha instead of the actual data. I don't think there is a way around this. This has nothing to do with curl or for that matter with any script language or scraper. It's purely related to the website in question.
Attached is the page I get from their website with your commend: Response.txt
Yep, that's what I see as well.
Can you describe what you figured out and/or how you were able to get the page content? As I mentioned above, my next step is going to be to proxy the site somehow, as don't see any web traffic difference between what my browser does and what my script does?!?
Thanks.
okay - hope this helps (yes I am impacted by this problem as well but don't have spare time from my regular job to help).
with regards to the test URL provided up (www.sunpass.com) here's something that might help.
This site is hosted behind incapsula cloud WAF's. and they have a few bot protection functions, including injecting recaptcha's, but they also have a bot/automation protection mechanism of malformed Cookies with a CRLF (most browsers handle this fine, most automation tools (eg python/java/etc) break on this cookie).
Don't know if this will help in anyway.
Thanks, I’ll look into that (next week, when I’ll have some time).
I went down the rabbit hole trying to decode their silly JavaScript obfuscation and found some weird stuff, but something like a malformed cookie didn’t occur to me. I’ll also do some web searches on them to see what else I can find (I saw the their name in the initial response but my Google fu must’ve been weak when I looked for them).
Thank you!
I had a similar issue. Got away with urllib3.
Really? Wow, that'd be too easy, but I'll give it a shot as well. It couldn't hurt. 🤷♂️
convert the response into json
resp = requests.post(....) print(resp.json())
I succeeded to get content from the their web-server. BUT! They have a clever human verification built into their website. So when I scrape the content via script I'm presented with a recaptcha instead of the actual data. I don't think there is a way around this. This has nothing to do with curl or for that matter with any script language or scraper. It's purely related to the website in question.
If anyone experiences an issue with a different website, I would love to help.
Attached is the page I get from their website with your commend: Response.txt
Hi Aron,
Could you please help , I am facing same issue.
thanks
Hi Aron,
Could you please help , I am facing same issue.
thanks
Show me your work, and I'll try to take it from there. What are you trying to do? (Which website?)
Hi there CC: @Aron2560
I have the same issue. It can be easily tested on the https://www.ozon.ru:
It works perfectly via curl:
curl -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36" 'https://www.ozon.ru'
Response: a long content
But doesn't work via code (I use Apache HttpClient fluent, java):
Request.get("https://www.ozon.ru")
.addHeader(HttpHeaders.USER_AGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36")
.execute()
.returnContent()
.toString()
Response is:
<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3"></script></head><body style="margin:0px;height:100%"><iframe id="main-iframe" src="/_Incapsula_Resource?SWUDNSAI=30&xinfo=2-1431547-0%200NNN%20RT%281591636163716%2088%29%20q%280%20-1%20-1%200%29%20r%280%20-1%29%20B12%2811%2c348807%2c0%29%20U18&incident_id=580000270002950602-5855638190294722&edet=12&cinfo=0b000000&rpinfo=0" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 580000270002950602-5855638190294722</iframe></body></html>
It looks really strange. I tried with totally the same headers that send curl but it also doesn't work.
Request.get("https://www.ozon.ru")
.addHeader(HttpHeaders.USER_AGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36")
.addHeader(HttpHeaders.ACCEPT, "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9")
.addHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate")
.addHeader(HttpHeaders.ACCEPT_LANGUAGE, "en-US,en;q=0.9,ru-RU;q=0.8,ru;q=0.7")
.addHeader(HttpHeaders.CONNECTION, "keep-alive")
.addHeader("Upgrade-Insecure-Requests", "1")
.execute()
.returnContent()
.toString()
I'm confused and exhausted. Any ideas from you guys?
Kind regards, Alexey
Hello @AlexeyFreelancer, I got mixed results from the curl command you posted. Running the very same commend I got the robots whatever on the first run and the long content on the second run. The "robots" detectors nowadays are becoming smarter by the day, and I'm not in a pursuit to outsmart them. My specialty is to debug and find error where something is not working as it should. In your case, everything works as it should. The only thing is that sometimes you're being detected as a robot, while other times you slip through the cracks.
There are several unrelated questions going on in this thread. To answer the original question, you're being sent the servers robot.txt
because they've detected you're crawling and are instructing you to stop. This is not a defect in Requests.
I have a POST request that works perfectly with both Postman an cURL (it returns a JSON blob of data). However, when I perform the exact same request with Python's Requests library, I get a 200 success response, but instead of my JSON blob, I get this:
I've used HTTP request bins to verify that the request (headers and payload) from Postman/cURL is exactly the same as the one from Python Requests.
Here is my Postman request in cURL:
...and here is my Python code:
I have verified that the payload and headers are correct and valid. I don't think it's a cookies or redirect issue, since I've disabled both of those params with Postman/cURL and everything still works fine. I'm stymied how the host server is somehow able to tell the difference between two seemingly identical HTTP requests...
Any help would be much appreciated; thanks!