psf / requests-html

Pythonic HTML Parsing for Humans™
http://html.python-requests.org
MIT License
13.7k stars 975 forks source link

Exception: Execution context was destroyed, most likely because of a navigation. #251

Open WASHEDDEVELOPEUR opened 5 years ago

WASHEDDEVELOPEUR commented 5 years ago
from requests_html import HTMLSession
session = HTMLSession()

r = session.get('https://footdistrict.com/en/quickview/index/view?pid=139669')
r.html.render()

Traceback (most recent call last):
  File "C:\Python36\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle
    'userGesture': True,
pyppeteer.errors.NetworkError: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    r.html.render()
  File "C:\Python36\lib\site-packages\requests_html.py", line 583, in render
    content, result, page = self.session.loop.run_until_complete(_async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
  File "C:\Python36\lib\asyncio\base_events.py", line 468, in run_until_complete
    return future.result()
  File "C:\Python36\lib\site-packages\requests_html.py", line 564, in _async_render
    content = await page.content()
  File "C:\Python36\lib\site-packages\pyppeteer\page.py", line 780, in content
    return await frame.content()
  File "C:\Python36\lib\site-packages\pyppeteer\frame_manager.py", line 379, in content
    '''.strip())
  File "C:\Python36\lib\site-packages\pyppeteer\frame_manager.py", line 295, in evaluate
    pageFunction, *args, force_expr=force_expr)
  File "C:\Python36\lib\site-packages\pyppeteer\execution_context.py", line 55, in evaluate
    pageFunction, *args, force_expr=force_expr)
  File "C:\Python36\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle
    _rewriteError(e)
  File "C:\Python36\lib\site-packages\pyppeteer\execution_context.py", line 238, in _rewriteError
    raise type(error)(msg)
pyppeteer.errors.NetworkError: Execution context was destroyed, most likely because of a navigation.
csawtelle commented 5 years ago

Is there currently a work around to this?

sambling commented 5 years ago

Have this same issue and am wondering if anyone has a solution?

https://github.com/GoogleChrome/puppeteer/issues/3323

VKen commented 5 years ago

I got something working for a specific case of webpage redirect.

At time of writing my software and packages version is:

Python==3.7.3
requests-html==0.10.0
pyppeteer==0.0.25

# for ipython notebook asyncio issues
tornado==4.5.3

Here's a excerpt of the sample target page content with redirection using both javascript and meta-tag:

<script>url="http://example.com/somewhereelse";window.location.assign(url)</script>
<noscript><meta http-equiv="refresh" content="0; url=http://example.com/somewhereelse"></noscript>

The code I ran which errored was:

from requests_html import HTMLSession

session = HTMLSession()
session.get("http://mysite.com")
r.html.render()

The above code results in:

NetworkError: Execution context was destroyed, most likely because of a navigation.

if we look carefully at the documentation:

>>> help(r.html.render)

Help on method render in module requests_html:

render(retries: int = 8, script: str = None, wait: float = 0.2, scrolldown=False, sleep: int = 0, reload: bool = True, timeout: Union[float, int] = 8.0, keep_page: bool = False) method of requests_html.HTML instance
    Reloads the response in Chromium, and replaces HTML content
    with an updated version, with JavaScript executed.

    :param retries: The number of times to retry loading the page in Chromium.
    :param script: JavaScript to execute upon page load (optional).
    :param wait: The number of seconds to wait before loading the page, preventing timeouts (optional).
    :param scrolldown: Integer, if provided, of how many times to page down.
    :param sleep: Integer, if provided, of how many long to sleep after initial render.
    :param reload: If ``False``, content will not be loaded from the browser, but will be provided from memory.
    :param keep_page: If ``True`` will allow you to interact with the browser page through ``r.html.page``.

    If ``scrolldown`` is specified, the page will scrolldown the specified
    number of times, after sleeping the specified amount of time
    (e.g. ``scrolldown=10, sleep=1``).

    If just ``sleep`` is provided, the rendering will wait *n* seconds, before
    returning.

The key thing is the param, "sleep".

A few points to note:

  1. the above target page sample shows the meta refresh is content="0;... which means 0 seconds wait to redirect the page.
  2. Looking at the javacript code there's no wait/sleep/delay either.
  3. Under current hardware speeds, and internet access speed, I don't expect the chromium browser running headless to refresh/redirect and load target page slower than 1 seconds (unless it is a big page and multiple more redirects).

Therefore, 1 seconds wait is a reasonable time to set before returning render().

In addition we have to use keep_page for extraction of crucial information, to be shown later.

changing the input of the render() method to:

r.html.render(sleep=1, keep_page=True)

Allowed the code to run without issues. If it still errors (due to slow network speed, cpu busy, etc.), try again with higher sleep.

To find out the redirected page's URL:

>>> r.html.page.url

http://example.com/somewhereelse

This issue deals with page redirects erroring, and with this line of thought: Although the above solution works, and it's clunky to implement a try-except loop to retry with increasing sleep time to make it work.

I'm still trying to find an equivalent of "window.onload" method to get the sleep to be automatic or dynamic wait for response from headless browser to "ping back" rather than the current method of python doing increment "polling" to check whether the redirect is completed and target URL destination has been reached.

I'm all ears to better methods if anyone comes up with any.