psf / requests-html

Pythonic HTML Parsing for Humans™
http://html.python-requests.org
MIT License
13.64k stars 977 forks source link

Issue Rendering Javascript in a Thread #155

Open skamensky opened 6 years ago

skamensky commented 6 years ago

I'm having an issue calling the render function within a thread. It works perfectly for me outside of a thread but within a thread I get an error.

If this is truly a bug it should be reproducible using this snippet:

Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from threading import Thread
>>> from requests_html import HTMLSession
>>> def render_html():
...     session = HTMLSession()
...     r = session.get('http://python-requests.org/')
...     r.html.render()
...
>>> t = Thread(target=render_html)
>>> t.start()
>>> Exception in thread Thread-1:
Traceback (most recent call last):
  File "C:\Users\_REMOVED_\AppData\Local\Programs\Python\Python36\lib\threading.py", line 916, in _bootstrap_inner
    self.run()
  File "C:\Users\_REMOVED_\AppData\Local\Programs\Python\Python36\lib\threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "<stdin>", line 4, in render_html
  File "C:\Users\_REMOVED_\AppData\Local\Programs\Python\Python36\lib\site-packages\requests_html.py", line 572, in render
    self.session.browser  # Automatycally create a event loop and browser
  File "C:\Users\_REMOVED_\AppData\Local\Programs\Python\Python36\lib\site-packages\requests_html.py", line 679, in browser
    self.loop = asyncio.get_event_loop()
  File "C:\Users\_REMOVED_\AppData\Local\Programs\Python\Python36\lib\asyncio\events.py", line 694, in get_event_loop
    return get_event_loop_policy().get_event_loop()
  File "C:\Users\_REMOVED_\AppData\Local\Programs\Python\Python36\lib\asyncio\events.py", line 602, in get_event_loop
    % threading.current_thread().name)
RuntimeError: There is no current event loop in thread 'Thread-1'.
oldani commented 6 years ago

When asyncio.get_event_loop() is called inside a thread which is not the main it raises this error. Do you need sessions to be unique per thread? If not just do this:

>>> from threading import Thread
>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> session.browser
>>> def render_html():
...     r = session.get('http://python-requests.org/')
...     r.html.render()
...
>>> t = Thread(target=render_html)
>>> t.start()

Otherwise, let me know and fix could be done to allow what you want.

skamensky commented 6 years ago

That gets rid of the error and works as far as I can tell.

How does the package know which browser tab to parse when other threads are accessing the same session instance? Am I at risk of the the wrong virtual tabs/windows being parsed since by their nature threads could be switching virtual tabs at the same time? I had this issue when I was using a single instance of a virtual chrome browser using the selenium package.

Thanks for the tip!

oldani commented 6 years ago

Each time you call t.html.render it creates a new browser page "tab", do everything you want (e.g: evaluate js) and close that tab "unless you want to interact with the page, then you pass keep_page=True to render. That behavior should keep each thread without interfering with another thread tab.

One suggestion is to keep the number of simultaneous threads low since each page represents a process in chrome and it will consume lots resources going hight.

skamensky commented 6 years ago

I understand. So now my only question is: can we expect t.html.render to function properly if two separate threads open two tabs simultaneously and attempt to render the page in the virtual browser at the same time?

The reason I ask is because in selenium, you can only inject/execute javascript into a "tab" if the tab is active (i.e. selected) which means threads cannot inject/execute javascript into two tabs at the same moment.

eladbitton commented 6 years ago

I encountered with the same problem of RuntimeError: There is no current event loop in thread 'Thread-1'. Tried the snipet of @oldani in cmd and its not working for me. image

Using the latest python(3.6.5) and latest requests_html(0.9.0).

oldani commented 6 years ago

@eladbitton you forgot to run session.browser, look closely at the code above.

However @skamensky I realize another issue that won't allow what you want to achieve related to the event loop, basically to allow this a new event loop needs to be created by each thread, this is what I was thinking for a fix even though this won't allow you to run too many threads before running out of resources (a fix like this will run a chromium process by thread). I will suggest you wait for #146 to be merged and do this asynchronous instead of with threads.

I'm thinking to make this possible and add a warning for not doing this unless you are willing to sacrifice resources.

Commito commented 6 years ago

I also have encountered the same problem - There is no current event loop in thread 'Thread-4'. except mine is in Django app class. I can't render() function always raises an error. I've tried running render(keep_page=True) and session.browser with no success.

I'm running Django 2.0.3, Python 3.6.3, requests_html 0.9 and PyCharm Pro 2018.1. I'm using PyCharm's default virtual enviroment for Django.

screenshot_5 screenshot_6 screenshot_7

oldani commented 6 years ago

I will add a fix for this

cfournies commented 5 years ago

I have the same error, but it only happened when I'm using it inside of Django. if I run it locally will work. Do you have any ideas why?

oldani commented 5 years ago

Hi guys,

Yesterday we released v0.10.0 which now have full support for AsyncHTMLSession you can use session instead of the normal one and won't have this kind of issue.

The issue around Django I have to investigate it yet, can any of you give me more context on it @cfournies @Commito ?

Xyhlon commented 5 years ago

I got a similar error when starting multiple threads can you help? By the way you are doing great work @oldani class Loader: def init(self, user_agent=UserAgent, proxies=None, retries=RETRIES, rest=REST, opener=None, cache=None, headers=None, fast=False): self.user_agent = user_agent self.proxies = proxies self.retries = retries self.opener = opener self.cache = cache self.headers = headers self.session = Session() self.empty = set() self.queue = dict() self.base = None self.htmlsession = HTMLSession() self.htmlsession.browser

def ajaxload(self, url):
    r = self.htmlsession.get(url)
    r.html.render()
    pac = dict()
    pac['html'] = r.text
    pac['code'] = r.status_code
    print(r.url)

    return pac

errormultithread

cfournies commented 5 years ago

Hi @oldani I can help you with django error, let me know what you need. The code doesn't work when is use within django framework.

oldani commented 5 years ago

I think to know the key to the error here. The thing is the policy of the event loop, for this, we're going to have to create a new event loop per thread in this cases.

sayoun commented 5 years ago

Hello @oldani

I have the same error using Flask, I've got RuntimeError: There is no current event loop in thread 'Thread-2'. Happens when I use HTMLSession and then call session.browser inside a route or when I try to use AsyncHTMLSession , both raise the error.

I'm not using threads or asyncio in my project, It's a simple Flask app with one route. Tell me if you want me to provides more logs/output/screens.

ShamanNguyen commented 5 years ago

Hello @cfournies You can run r.htm.render() in django ?. I try so many way, but it's still exception There is no current event loop in thread

jasonniebauer commented 5 years ago

I have the same issue in my Flask application.

CarreyC commented 4 years ago

I have the same issue on Flask

ShamanNguyen commented 4 years ago

@CarreyC It's ok when run by command.

434718954 commented 4 years ago

I have the same issue now i use it in django ,when i add loop in django, i occured error no singal in main thread

NAveeN4416 commented 4 years ago

Each time you call t.html.render it creates a new browser page "tab", do everything you want (e.g: evaluate js) and close that tab "unless you want to interact with the page, then you pass keep_page=True to render. That behavior should keep each thread without interfering with another thread tab.

One suggestion is to keep the number of simultaneous threads low since each page represents a process in chrome and it will consume lots resources going hight.

Can you please suggest how can i use this in django framework ?

tingwei628 commented 4 years ago

I found this on stackoverflow.

Here is my workaround with Flask.

from requests_html import AsyncHTMLSession
import asyncio
import pyppeteer

async def get_post() {
    new_loop=asyncio.new_event_loop()
    asyncio.set_event_loop(new_loop)
    session = AsyncHTMLSession()
    browser = await pyppeteer.launch({ 
        'ignoreHTTPSErrors':True, 
        'headless':True, 
        'handleSIGINT':False, 
        'handleSIGTERM':False, 
        'handleSIGHUP':False
    })
    session._browser = browser
    resp_page = await session.get(your_query_url)
    await resp_page.html.arender()
    return resp_page
}
gelodefaultbrain commented 2 years ago

was there a fix with this issue?

gelodefaultbrain commented 2 years ago

Hello, just wondering... was this issue fixed ? so shall I just re-install the package?

abdullzz commented 2 years ago

I found this on stackoverflow.

Here is my workaround with Flask.

from requests_html import AsyncHTMLSession
import asyncio
import pyppeteer

async def get_post() {
    new_loop=asyncio.new_event_loop()
    asyncio.set_event_loop(new_loop)
    session = AsyncHTMLSession()
    browser = await pyppeteer.launch({ 
        'ignoreHTTPSErrors':True, 
        'headless':True, 
        'handleSIGINT':False, 
        'handleSIGTERM':False, 
        'handleSIGHUP':False
    })
    session._browser = browser
    resp_page = await session.get(your_query_url)
    await resp_page.html.arender()
    return resp_page
}

@Têng Ûi may I know the full code on how you call this function? i still cannot make it work

abdullzz commented 2 years ago

its giving me this error RuntimeError: Event loop is closed sys:1: RuntimeWarning: coroutine 'Launcher.killChrome' was never awaited

MrDarkness117 commented 1 year ago

I found this on stackoverflow.

Here is my workaround with Flask.

from requests_html import AsyncHTMLSession
import asyncio
import pyppeteer

async def get_post() {
    new_loop=asyncio.new_event_loop()
    asyncio.set_event_loop(new_loop)
    session = AsyncHTMLSession()
    browser = await pyppeteer.launch({ 
        'ignoreHTTPSErrors':True, 
        'headless':True, 
        'handleSIGINT':False, 
        'handleSIGTERM':False, 
        'handleSIGHUP':False
    })
    session._browser = browser
    resp_page = await session.get(your_query_url)
    await resp_page.html.arender()
    return resp_page
}

its giving me this error RuntimeError: Event loop is closed sys:1: RuntimeWarning: coroutine 'Launcher.killChrome' was never awaited

This is returning me with a coroutine object instead of html object. Did you possibly have that?

MrDarkness117 commented 1 year ago

I found this on stackoverflow. Here is my workaround with Flask.

from requests_html import AsyncHTMLSession
import asyncio
import pyppeteer

async def get_post() {
    new_loop=asyncio.new_event_loop()
    asyncio.set_event_loop(new_loop)
    session = AsyncHTMLSession()
    browser = await pyppeteer.launch({ 
        'ignoreHTTPSErrors':True, 
        'headless':True, 
        'handleSIGINT':False, 
        'handleSIGTERM':False, 
        'handleSIGHUP':False
    })
    session._browser = browser
    resp_page = await session.get(your_query_url)
    await resp_page.html.arender()
    return resp_page
}

@têng Ûi may I know the full code on how you call this function? i still cannot make it work

UPD: You probably need to do asyncio.run() on that function so you get the result. See if you haven't done that.

migonsa commented 7 months ago

await resp_page.html.arender() never returns...

Tamupiwa commented 4 months ago

I found this on stackoverflow.

Here is my workaround with Flask.

from requests_html import AsyncHTMLSession
import asyncio
import pyppeteer

async def get_post() {
    new_loop=asyncio.new_event_loop()
    asyncio.set_event_loop(new_loop)
    session = AsyncHTMLSession()
    browser = await pyppeteer.launch({ 
        'ignoreHTTPSErrors':True, 
        'headless':True, 
        'handleSIGINT':False, 
        'handleSIGTERM':False, 
        'handleSIGHUP':False
    })
    session._browser = browser
    resp_page = await session.get(your_query_url)
    await resp_page.html.arender()
    return resp_page
}

This worked for me! Make sure to run asyncio.run(get_post) to get the result instead of coroutine