psf / requests-html

Pythonic HTML Parsing for Humans™
http://html.python-requests.org
MIT License
13.72k stars 977 forks source link

render() triggers website protections #275

Open DocToime opened 5 years ago

DocToime commented 5 years ago

Hi

I've just started working with requests-html, and render() seems to be triggering a website protection mechanism, and I'm not sure why. The below URL loads a number of .js scripts with data I'm trying to access. For the first r.html.text below, this data just returns "Loading..." as the js scripts haven't yet run. After r.html.render(), the page rejects the request. Any advice on what is going wrong here and how to circumvent it would be very much appreciated - code below:

`from requests_html import HTMLSession

session = HTMLSession()

r = session.get("https://register.epo.org/application?number=EP16190441&lng=en&tab=federated")

print(r.html.text)

r.html.render()

print(r.html.text)`

bisguzar commented 5 years ago

I can render the page without any issue/rejection. Here is my example https://paste.ubuntu.com/p/ccRqzjBpSh/. Can you be more specific? Maybe it's about your connection. For example requests rate.

DocToime commented 5 years ago

Thanks for looking at this. Actually, you have the same issue as me - the page you've rendered is the rejection page, not the actual page you get if you go via a browser. You'll also see if you print before using render() that you get the right web page, but just without the data that I'm looking to access (marked as 'loading').

identei commented 5 years ago

This is a security feature implemented by the site you're trying to scrape in order to prevent bots. The rejection message could be based on a number of aspects that they're watching traffic for, such as user agent, speed of requests, etc.

A possibility to circumvent would be to analyze the underlying network traffic on their site and hit the API directly, but respectfully. But this is an issue relating directly to the site you're trying to scrape, not this codebase.

DocToime commented 5 years ago

Hi Ecript

The strange thing is though that it is only the render() function that triggers this. It isn't a speed of request issue presumably as this happens with just a single link. Does the render () function use a different user agent? Is it possible to set this specifically?

identei commented 5 years ago

Ah, I don't see a way to change it in the render() call from what I can see in the documentation and source. Before the render call you can set a different user agent via requests. The render() call defaults to the Chrome user agent.

from requests_html import HTMLSession session = HTMLSession() headers = {'user-agent': 'my-app/0.0.1'} r = session.get(url, headers=headers) http://docs.python-requests.org/en/master/user/quickstart/#custom-headers

But that doesn't solve your problem.

I would then recommend investigating the underlying network traffic that's going on when the javascript is rendered using your browser's dev tools. I've been kind of digging into that method. They're somehow identifying bots. Even just a simple requests get call triggers the denied message. This is going to be a touchy page to scrape.

In this case, looks like hitting the API url directly works, and multiple API urls are used for different parts of information on the page and then parsed out. Here's an example url: https://api.register.epo.org/v1/at/AP/EP16190441.js

DocToime commented 5 years ago

Thanks Ecript. I suspect headers may be the issue with using render(), as you get the same bot issue just using requests.get(), unless you specify 'real' user_agent headers.

Investigating the underlying network traffic (and then doing something about it) sounds a bit beyond my capabilities at the minute, but I'll see what I can find out. Unfortunately that example url lacks the key status info I'm looking for (patent status), but will definitely look into it

Do you think adding the ability to specify the render() user agent would be a worthwhile feature to request?

Thanks!

identei commented 5 years ago

I might be able to take a crack at it since I've already been reading over the source. I'll take a look and get back to you! Not 100% sure it will fix your issue, but it's worth a shot.

DocToime commented 5 years ago

Amazing thank you!

identei commented 5 years ago

Okay, I figured out how to change the user agent without altering the base code.

from requests_html import HTMLSession
# Sets the user agent to whatever you choose
session = HTMLSession(browser_args=["--no-sandbox", "--user-agent='Testing'"])
r = session.get(url)
r.html.render()

The "no-sandbox" option is passed in by default, so you include it there to make sure it still makes it through when you override the browser_args argument. Those two arguments get passed to the Chromium session that is created when render() is called, not before. Until you call render() it will still use the default Chromium user agent unless you change it using the method I outlined in previous comments.


I tested this solution with your original issue, and it fixed the problem!

I traded out the default Chromium user agent with a different one I found here: https://developers.whatismybrowser.com/useragents/explore/software_name/firefox/

I used the user agent Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1 and executed the code in your original comment, and that fixed your issue. I was loading content from the page that needed javascript to render. It was looking at the user agent to determine if you were a bot.

This begs the question of whether or not the default user agent in the source should be changed to something else maybe a more recent device or something, but I think that's dependent on how many of these issues crop up. I'll leave it to Kenneth to decide that.

Best of luck!

DocToime commented 5 years ago

This is fantastic thank you! Would you be able to post the code that you got working? I'm getting a series of errors (see below) when trying to run render() with those arguments. I assume I'm doing something silly, but have tried a number of different permutations and just can't get it to work:

from requests_html import HTMLSession url = "https://register.epo.org/application?number=EP16190441&lng=en&tab=federated" Testing = "Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1" session = HTMLSession(browser_args=["--no-sandbox", "--user-agent='Testing'"]) r = session.get(url) r.html.render() print(r.html.text)

Traceback (most recent call last): File "...test.py", line 8, in r.html.render() File "...\AppData\Roaming\Python\Python37\site-packages\requests_html.py", line 598, in render content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page)) File "C:\Program Files\Python37\Lib\asyncio\base_events.py", line 584, in run_until_complete return future.result() File "...\AppData\Roaming\Python\Python37\site-packages\requests_html.py", line 537, in _async_render await page.close() File "~\AppData\Roaming\Python\Python37\site-packages\pyppeteer\page.py", line 1465, in close {'targetId': self._target._targetId}) pyppeteer.errors.NetworkError: Protocol error Target.closeTarget: Target closed.

identei commented 5 years ago

Yup! This part here: "--user-agent='Testing'" You made a variable Testing but it's a string in the sample I gave you. In my sample, I literally set my user agent to "Testing". You have to replace that with the user agent you want to use.

DocToime commented 5 years ago

That was my first assumption, but replacing 'Testing' with 'Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1' (with or without those quotation marks, and also trying other user agents that I know work with Requests) gives the same error as the one I had above. Sorry to keep dragging this out!

session= HTMLSession(browser_args=["--no-sandbox", "--user-agent='Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1'"])

trekma commented 5 years ago

It's impossible to use HTMLSession with browser_args, it returns an error init() got an unexpected keyword argument 'browser_args'

DocToime commented 5 years ago

Thanks Trekma. That explains why I couldn't get it to work. Is there any workaround to change the default user agent as per the above?

lightiverson commented 5 years ago

Hi @LaurT . I'm facing a similar issue as yours and @Ecript 's solution worked perfectly for me! With a small addition I managed to get it working for your page as well.

By investigating the underlying traffic I found that the quotation marks around 'Testing' in session = HTMLSession(browser_args=["--no-sandbox", "--user-agent='Testing'"]) were being included in the GET requests headers. Removing the quotation marks did the trick.

The code below prints your webpage including javascript loaded content.

from requests_html import HTMLSession
session = HTMLSession(browser_args=["--no-sandbox", '--user-agent=Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1'])
r = session.get('https://register.epo.org/application?number=EP16190441&lng=en&tab=federated')
# This is necessary for your webpage in particular because it takes around 13 seconds for the page to load.
r.html.render(timeout=15)
print(r.html.text)