python / cpython

The Python programming language
https://www.python.org
Other
63.51k stars 30.42k forks source link

HTMLParser handle_starttag replaces entity references in attribute value even without semicolon #69426

Open abad39fc-470e-449f-8316-aeb234563ab3 opened 9 years ago

abad39fc-470e-449f-8316-aeb234563ab3 commented 9 years ago
BPO 25239
Nosy @ezio-melotti
Files
  • parserentity.py: an example of the example described
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/ezio-melotti' closed_at = None created_at = labels = ['type-bug', 'library'] title = 'HTMLParser handle_starttag replaces entity references in attribute value even without semicolon' updated_at = user = 'https://bugs.python.org/frogcoder' ``` bugs.python.org fields: ```python activity = actor = 'ezio.melotti' assignee = 'ezio.melotti' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'frogcoder' dependencies = [] files = ['40588'] hgrepos = [] issue_num = 25239 keywords = [] message_count = 2.0 messages = ['251654', '251657'] nosy_count = 2.0 nosy_names = ['ezio.melotti', 'frogcoder'] pr_nums = [] priority = 'normal' resolution = None stage = 'test needed' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue25239' versions = ['Python 2.7', 'Python 3.4', 'Python 3.5', 'Python 3.6'] ```

    Linked PRs

    abad39fc-470e-449f-8316-aeb234563ab3 commented 9 years ago

    In the document of HTMLParser.handle_starttag, it states "All entity references from html.entities are replaced in the attribute values." However it will replace the string if it matches ampersand followed by the entity name without the semicolon.

    For example \<a href="go?t=buy&currency=usd">foo\</a> will produce "t=buy¤cy=usd" as the value of href attribute due to "curren" is the entity name for the currency sign.

    ezio-melotti commented 9 years ago

    This seems indeed to be a bug. The relevant bit is at http://www.w3.org/TR/html5/syntax.html#consume-a-character-reference :

    If the character reference is being consumed as part of an attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or an alphanumeric ASCII character, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned. However, if this next character is in fact a "=" (U+003D) character, then this is a parse error, because some legacy user agents will misinterpret the markup in those cases.

    Off the top of my head, this paragraph is not implemented in HTMLParser (and it should). Also note that <a href="go?t=buy&currency=usd">foo</a> is not valid HTML and the & should have been escaped with &amp;.