Open wahlm opened 2 years ago
Good find! It seems that the Python HTML library maps both ¢
and ¢
(https://github.com/python/cpython/blob/3.10/Lib/html/entities.py#L522-L523), which they apparently do for some entities according to the docs: https://docs.python.org/3/library/html.entities.html#html.entities.html5. So not really a bug, but definitely undesired behavior. Needs checking if the Python HTMLParser
can be configured to ignore entities that do not close with a semicolon.
Edit: This is actually in the HTML5 spec: https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references
HTMLParser
has a convert_charrefs
setting to disable unescaping character data, which for example is applied here:
https://github.com/python/cpython/blob/224cd0c3bf2e24f7731fb0b6f31b3839641307d7/Lib/html/parser.py#L161-L164
However the setting is ignored when parsing attribute values 😕: https://github.com/python/cpython/blob/224cd0c3bf2e24f7731fb0b6f31b3839641307d7/Lib/html/parser.py#L324-L325
Looking further into this, there is a related bug report that indicates that in attribute values, character entities not terminated with a semicolon should not be parsed, if they are followed by =
or an ASCII character.
Note: The &
in the href attribute of the BookmarkHtmlTag should be escaped with &
. So this is most likely a bug in the software generating the HTML export.
I missed to mention that the source for the import was a HTML export from Firefox. I checked again and for every URL (found in a <A HREF="url">
tag containing an ampersand, no escaping to &
is done. Whether this is valid and still allowed is controversly discussed and the standards seem to be a bit vague at this point.
Anyway, for any ampersand found at a different position the escaping is done.
So this is the way it is and because of the header <!DOCTYPE NETSCAPE-Bookmark-file-1>
this file format may define its own rules.
Tried to get this fixed in the Python HTMLParser
some time ago, but it seems there wasn't enough interest: https://github.com/python/cpython/pull/95215
The only other option I can think of is trying a different parser, and check if it had the desired behavior, and similar performance.
Just fell over some unwanted modification of special URLs during HTML import. The routing functionality of openstreetmap produces URLs in the form of "https://routing.openstreetmap.de/?z=13¢er=51.098779%2C14.340334&loc=..." Unfortunately the ¢ is converted to a "cent" character by the HTML parser/importer. I did not have a fix for this, yet, but here is a test to show the problem:
and here is the fault:
As this looks like some kind of bug in HTML decoding (at least the ; at the end is missing!), this error might occur for other strings forming HTML entities and not only for "¢".