kovidgoyal / html5-parser

Fast C based HTML 5 parsing for python
Apache License 2.0
678 stars 33 forks source link

Break tag presentation #27

Closed tonal closed 3 years ago

tonal commented 3 years ago

See #26

Chrome show normal page. Url: https://harvia-top.ru/catalog/elektrokamenki/elektrokamenki-harvia

Content:

<!DOCTYPE html>
<html>

<head>
<meta charset="utf-8">
...
<noscript><style amp-boilerplate>body{-webkit-animation:none;-moz-animation:none;-ms-animation:none;animation:none}>/style></noscript>

</head>
<html>
<body id="page" class="yoopage  column-right "><div class="sm-pusher"><div class="sm-content"><div class="sm-content-inner">
    <header><div class="sliderarea">
...

code:

from html5_parser import parse
from lxml.etree import tostring
from urllib.request import urlopen
root = parse(urlopen('https://harvia-top.ru/catalog/elektrokamenki/elektrokamenki-harvia').read())
print(tostring(root, encoding='unicode', pretty_print=True))

Output:

<html>
  <head>
<meta charset="utf-8"/>
...
<noscript><style amp-boilerplate="">body{-webkit-animation:none;-moz-animation:none;-ms-animation:none;animation:none}&gt;/style&gt;&lt;/noscript&gt;

&lt;/head&gt;
&lt;html&gt;
&lt;body id="page" class="yoopage  column-right "&gt;&lt;div class="sm-pusher"&gt;&lt;div class="sm-content"&gt;&lt;div class="sm-content-inner"&gt;
...
kovidgoyal commented 3 years ago

That will be because the page is loaded via JavaScript. mechanize does not support javascript.