Closed a3ee0543-42eb-4e6c-9627-ce337695840f closed 2 years ago
$ pip freeze | grep beautifulsoup4
beautifulsoup4==4.6.3
$ python
Python 3.7.0 (default, Jul 23 2018, 20:24:19)
[Clang 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> text = "<![hi world]>"
>>> BeautifulSoup(text, 'html.parser')
/Users/conradroche/solvvy/ml-pipeline/venv/lib/python3.7/site-packages/bs4/builder/_htmlparser.py:78: UserWarning: unknown status keyword 'hi ' in marked section
warnings.warn(msg)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/conradroche/solvvy/ml-pipeline/venv/lib/python3.7/site-packages/bs4/__init__.py", line 282, in __init__
self._feed()
File "/Users/conradroche/solvvy/ml-pipeline/venv/lib/python3.7/site-packages/bs4/__init__.py", line 343, in _feed
self.builder.feed(self.markup)
File "/Users/conradroche/solvvy/ml-pipeline/venv/lib/python3.7/site-packages/bs4/builder/_htmlparser.py", line 247, in feed
parser.feed(markup)
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/html/parser.py", line 111, in feed
self.goahead(0)
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/html/parser.py", line 179, in goahead
k = self.parse_html_declaration(i)
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/html/parser.py", line 264, in parse_html_declaration
return self.parse_marked_section(i)
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_markupbase.py", line 160, in parse_marked_section
if not match:
UnboundLocalError: local variable 'match' referenced before assignment
>>>
Thanks for the report.
HTMLParser.error() was supposed to raise an exception, but the BeautifulSoup project just prints a warning here:
def error(self, msg):
warnings.warn(msg)
https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/builder/_htmlparser.py#L69
As a result of this, the code doesn't stop executing in the following branch:
else:
self.error('unknown status keyword %r in marked section' % rawdata[i+3:j])
if not match:
return -1
https://github.com/python/cpython/blob/3.7/Lib/_markupbase.py#L159
Note that HTMLParser.error() was removed in Python 3.5 (https://github.com/python/cpython/commit/73a4359eb0eb624c588c5d52083ea4944f9787ea#diff-1a7486df8279dbac7f20abd487947845L171) and there is an open issue about the status of _markupbase.ParserBase.error(): bpo-31844.
I also think that https://github.com/python/cpython/commit/73a4359eb0eb624c588c5d52083ea4944f9787ea#diff-1a7486df8279dbac7f20abd487947845L171 may have caused a minor regression when it was removed the error() method and its uses from the HTMLParser class. It still calls the parse_marked_section() method of _markupbase.ParserBase() which it then calls the error() method of _markupbase.ParserBase():
elif rawdata[i:i+3] == '<![':
return self.parse_marked_section(i)
https://github.com/python/cpython/blob/3.7/Lib/html/parser.py#L264
Attaching a test case for this issue since the self.error code path was not covered in the current test suite. A PR is open proposing match variable initialisation with None https://github.com/python/cpython/pull/17643 .
diff --git Lib/test/test_htmlparser.py Lib/test/test_htmlparser.py
index a2bfb39d16..a9aff11706 100644
--- Lib/test/test_htmlparser.py
+++ Lib/test/test_htmlparser.py
@@ -766,6 +766,18 @@ class AttributesTestCase(TestCaseBase):
[("href", "http://www.example.org/\">;")]),
("data", "spam"), ("endtag", "a")])
+ def test_invalid_keyword_error_exception(self):
+ class InvalidMarkupException(Exception):
+ pass
+
+ class MyHTMLParser(html.parser.HTMLParser):
+
+ def error(self, message):
+ raise InvalidMarkupException(message)
+
+ parser = MyHTMLParser()
+ with self.assertRaises(InvalidMarkupException):
+ parser.feed('<![invalid>')
if __name__ == "__main__":
unittest.main()
(Author of PR https://github.com/python/cpython/pull/17643)
Since the behavior of self.error() is determined by the subclass implementation, an Exception is not guaranteed. How should this be handled? It seems the options are:
Happy to update the PR with @xtreak's test cases.
HTMLParser is supposed to follow the HTML5 standard, and never raise an error.
For the example in the first comment ("\<![hi world]>"), the steps should be:
I agree that the error should be fixed by setting match
to None, and a test case that triggers the UnboundLocalError (before the fix) should be added as well (what provided by Karthikeyan looks good).
However, it also seems wrong that HTMLParser ends up calling self.error() through Lib/_markupbase.py ParserBase after HTMLParser.error() and all the calls to it have been removed. _markupbase.py is internal, so it should be safe to remove ParserBase.error() and the code that calls it as suggested in bpo-31844 (and possibly to merge _markupbase into html.parser too). Even if this is done and the call to self.error() is removed from ParserBase.parse_marked_section(), match
still needs to be set to None (either in the else
branch or before the if/elif
block).
_markupbase.py is internal, so it should be safe to remove ParserBase.error() and the code that calls it as suggested in bpo-31844
Should I reopen https://github.com/python/cpython/pull/8562 then?
I think so. It might be worth double-checking if BeautifulSoup (and possibly other 3rd party libs) use _markupbase.py and/or ParserBase.error(). If they do, giving them a heads up and/or going through a regular deprecation process might be good, otherwise PR 8562 can be reopened and merged after removing the call to self.error() in ParserBase.parse_marked_section().
Here's a simplified real world example that I found if it helps anyone:
import bs4
html = '''<html><head><!fill in here--> <![end--><!-- start: SSI y-xtra-topspace.shtml --><!-- --></head><body></body></html>'''
soup = bs4.BeautifulSoup(html, 'html.parser')
For reference, my Python version is 3.8.2 and my bs4 version is 4.8.1
I updated to bs4 version 4.9.0 and the same issue occurs.
I'm the maintainer of Beautiful Soup. I learned about this issue when one of my users filed a bug for it against Beautiful Soup (https://bugs.launchpad.net/beautifulsoup/+bug/1883264).
BeautifulSoupHTMLParser (my subclass of html.parser.HTMLParser) implements error() only to prevent the NotImplementedError raised by ParserBase.error(). The docstring for my error() implementation says: "this method is called only on very strange markup and our best strategy is to pretend it didn't happen and keep going."
It shouldn't affect Beautiful Soup if, in a future version of Python, HTMLParser.error is never called and ParserBase is removed.
Based on the changelog, it seems that the fix for this issue has only landed in Python 3.10 but it's still affecting versions 3.7, 3.8 and 3.9. There are a few projects such as Apache Beam that have only recently added support for Python 3.9 with no support for 3.10. Beautiful Soup is a popular tool that's affected by this bug and so my question is if backporting this bugfix to at least Python 3.9 which is still an officially supported release is on the table?
Marek, I can merge a fix for 3.9 for this. I don't think we should be removing _markupbase.ParserBase.error() in 3.9 as python/cpython#52808 is doing. So we'd need a new patch, like python/cpython#61843. However, that will need a test as well.
Would you be interested in creating the patch?
Alright, the PR is up. This is my first contribution to Python and I had to sign the Contributor Agreement so it may take 1 business day before you're able to accept it. In the meantime, can you please review and let me know if anything needs to change? I wasn't sure where / how to add an entry in the NEWS.d folder. Can you please help me? Thanks!
any question about this issue? if not, i think it is better to close it
This has been fixed in these PRs so it can be closed:
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/ezio-melotti' closed_at = None created_at =
labels = ['3.7', 'type-bug', 'library']
title = '_markupbase.py fails with UnboundLocalError on invalid keyword in marked section'
updated_at =
user = 'https://bugs.python.org/kodial'
```
bugs.python.org fields:
```python
activity =
actor = 'mareksuscak'
assignee = 'ezio.melotti'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'kodial'
dependencies = []
files = ['49070']
hgrepos = []
issue_num = 34480
keywords = ['patch']
message_count = 14.0
messages = ['323962', '323966', '358616', '359301', '359356', '359362', '359428', '366648', '366649', '366650', '371418', '416130', '416546', '416582']
nosy_count = 10.0
nosy_names = ['ezio.melotti', 'lukasz.langa', 'berker.peksag', 'corona10', 'xtreak', 'kodial', 'jkamdjou', 'leonardr', 'Pikamander2', 'mareksuscak']
pr_nums = ['17643', '8562', '32256']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue34480'
versions = ['Python 3.7']
```