psf / requests-html

Pythonic HTML Parsing for Humans™
http://html.python-requests.org
MIT License
13.72k stars 977 forks source link

render not working #540

Open StefanHri opened 1 year ago

StefanHri commented 1 year ago

Hi

I have the following code:

from requests_html import HTMLSession

session = HTMLSession()

url = "https://www.wirelesspowerconsortium.com/products"
r = session.get(url)

r.html.render()  # this executes the java script?
print(r.html.html)

which prints:

<!DOCTYPE html>
<html>
   <head>
      <meta charset="utf-8">
      <meta http-equiv="X-UA-Compatible" content="IE=edge">
      <meta name="viewport" content="width=device-width,initial-scale=1,user-scalable=no">
      <meta name="description" content="">
      <!--[if IE]>
      <link rel="icon" href="/favicon.ico">
      <![endif]-->
      <title></title>
      <script type="text/javascript" async="" src="https://www.gstatic.com/recaptcha/releases/vpEprwpCoBMgy-fvZET0Mz6L/recaptcha__en.js" crossorigin="anonymous" integrity="sha384-jffSm4FBmQyLvL1V8BXFUBdZCFkPLi8N+X9NGYs2YKU4uUiYzy53t/3mlwj1fdwI"></script><script defer="defer" src="/js/chunk-vendors.71783972.js"></script><script defer="defer" src="/js/app.ba4ecc78.js"></script>
      <link href="/css/app.adf1537b.css" rel="stylesheet">
      <link rel="icon" type="image/svg+xml" href="/img/icons/favicon.svg">
      <link rel="icon" type="image/png" sizes="32x32" href="/img/icons/favicon-32x32.png">
      <link rel="icon" type="image/png" sizes="16x16" href="/img/icons/favicon-16x16.png">
      <link rel="manifest" href="/manifest.json">
      <meta name="theme-color" content="#4DBA87">
      <meta name="apple-mobile-web-app-capable" content="no">
      <meta name="apple-mobile-web-app-status-bar-style" content="default">
      <meta name="apple-mobile-web-app-title" content="wpc-vue3">
      <link rel="apple-touch-icon" href="/img/icons/apple-touch-icon-152x152.png">
      <link rel="mask-icon" href="/img/icons/safari-pinned-tab.svg" color="#4DBA87">
      <meta name="msapplication-TileImage" content="/img/icons/msapplication-icon-144x144.png">
      <meta name="msapplication-TileColor" content="#000000">
   </head>
   <body>
      <noscript><strong>We're sorry but wpc doesn't work properly without JavaScript enabled. Please enable it to continue.</strong></noscript>
      <div id="app"></div>
      <script src="https://www.recaptcha.net/recaptcha/api.js?onload=vueRecaptchaApiLoaded&amp;render=explicit" async="" defer="defer"></script>
   </body>
</html>

I am interested in the content of body but it looks like it is not rendered correctly. What I am doing wrong?

vvaezian commented 1 year ago

I have the same issue with the following URL: https://www.bcliquorstores.com/product-catalogue

from requests_html import HTMLSession, AsyncHTMLSession

headers = {
    'Host': 'www.bcliquorstores.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/111.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.7,fa;q=0.3',
}

asession = AsyncHTMLSession()
r = await asession.get(url, headers=headers)
await r.html.arender()
res = r.html.html
surister commented 1 year ago

I created a small tool to somewhat test stuff like this, since issues like this are not uncommon and some of them are bound not to be the fault of the library, it's the nature of the crawling world, sometimes in order for your request to go through you have to apply different techniques, like User Agent Spoofing, HTTP2, TLS Spoofing, Proxies.. etc

image image

You can see that the response length it's identical meaning that it's js rendering part (pypeteer).

I will keep investigating, but bare in mind that SPA's usually get their data from other APIs, it might interesting to see if those APIs are available for the public 😉.

ajatkj commented 1 year ago

I was having the same issue and found out that it was an issue with pyppeteer using (very) old version of Chromium. Once I upgraded the Chromium browser things worked as expected.

  1. Delete the existing Chromium installation
  2. Install the latest version of Chromium by setting PYPPETEER_CHROMIUM_REVISION to the latest revision. You can find the latest revision here.

Let me know if this works.

keisanng commented 1 year ago

Same issue here