python / cpython

The Python programming language
https://www.python.org
Other
62.44k stars 29.97k forks source link

HTMLParser stops parsing upon encountering `<style>` tag #118350

Open savchenko opened 4 months ago

savchenko commented 4 months ago

Bug report

Bug description:

An example where parsing stops after the <style color="red">:

from html.parser import HTMLParser
from io import StringIO

class HTML2text(HTMLParser):
    def __init__(self):
        super().__init__()
        self.data = StringIO()
    def handle_data(self, html):
        self.data.write(html)
    def get_data(self):
        return self.data.getvalue().strip()

html_test = '''
<!DOCTYPE html>
<head><title>Glued</title></head><body><some><style color="red">title</bar>
<h1>Spacious             </h1><a href="https://heading.net">heading.net</a>
<span>not<a href="https://www.arpa.home">my.home.arpa</a><p>        URL</p>
</body></html>
'''

parser = HTML2text()
parser.feed(html_test)
print(parser.get_data())

Changing a single character in the word "style" restores the normal functionality.

CPython versions tested on:

3.11

Operating systems tested on:

Linux

Linked PRs

JelleZijlstra commented 4 months ago

Isn't this because you didn't close your <style> tag? If I remember correctly style tags go on until </style> is seen regardless of any other tag-like text within the tag, because they may contain text in other languages.

savchenko commented 4 months ago

@JelleZijlstra , indeed! Closing <style> allows the snippet to be parsed. However, isn't it inconsistent with the the behaviour observed when parsing other tags?

For example, this broken HTML is parsed correctly:

<head><title>Rebelious<h1>Heading<a href="https://example.net">example.net
<span>not<a href="https://www.arpa.home">arpa.home<p>Paragraph<h2>and more
vadmium commented 4 months ago

The difference is that