kovidgoyal / html5-parser

Fast C based HTML 5 parsing for python
Apache License 2.0
678 stars 33 forks source link

Empty page for incorrect page #26

Closed tonal closed 3 years ago

tonal commented 3 years ago

Url: https://harvia-top.ru/catalog/elektrokamenki/elektrokamenki-harvia Content:

<!DOCTYPE html>
<html>

<head>
<meta charset="utf-8">
...
tart{from{visibility:hidden}to{visibility:visible}}</style>
<noscript><style amp-boilerplate>body{-webkit-animation:none;-moz-animation:none;-ms-animation:none;animation:none}>/style></noscript>

</head>
<html>
<body id="page" class="yoopage  column-right "><div class="sm-pusher"><div class="sm-content"><div class="sm-content-inner">
    <header><div class="sliderarea">
...

code:

from html5_parser import parse
from lxml.etree import tostring
root = parse(https://harvia-top.ru/catalog/elektrokamenki/elektrokamenki-harvia)
print(tostring(root))

Output:

<html><head/><body>https://harvia-top.ru/catalog/elektrokamenki/elektrokamenki-harvia</body></html>
kovidgoyal commented 3 years ago

parse does not download urls. You need to provide it actual html, not a url.