Unwanted modification of special URLs on import

wahlm commented 2 years ago

Just fell over some unwanted modification of special URLs during HTML import. The routing functionality of openstreetmap produces URLs in the form of "https://routing.openstreetmap.de/?z=13&center=51.098779%2C14.340334&loc=..." Unfortunately the &cent is converted to a "cent" character by the HTML parser/importer. I did not have a fix for this, yet, but here is a test to show the problem:

    def test_import_strange(self):
        html_tags = [
            BookmarkHtmlTag(href='https://routing.openstreetmap.de/?z=13&center=51.098779%2C14.340334',
                            title='Example title', description='Example description',
                            add_date='1', tags='example-tag'),
        ]
        import_html = self.render_html(tags=html_tags)
        result = import_netscape_html(import_html, self.get_or_create_test_user())

        # Check result
        self.assertEqual(result.total, 1)
        self.assertEqual(result.success, 1)
        self.assertEqual(result.failed, 0)

        # Check bookmarks
        bookmarks = Bookmark.objects.all()
        self.assertEqual(len(bookmarks), 1)
        url_imported = bookmarks.first().url
        url_original = html_tags[0].href
        self.assertEqual(url_original, url_imported, "Imported URL was modified")
        self.assertBookmarksImported(html_tags)

and here is the fault:

AssertionError: 'https://routing.openstreetmap.de/?z=13&center=51.098779%2C14.340334' != 'https://routing.openstreetmap.de/?z=13¢er=51.098779%2C14.340334'
- https://routing.openstreetmap.de/?z=13&center=51.098779%2C14.340334
?                                       ^^^^^
+ https://routing.openstreetmap.de/?z=13¢er=51.098779%2C14.340334
?                                       ^
 : Imported URL was modified

As this looks like some kind of bug in HTML decoding (at least the ; at the end is missing!), this error might occur for other strings forming HTML entities and not only for "&cent".

sissbruecker commented 2 years ago

Good find! It seems that the Python HTML library maps both ¢ and &cent (https://github.com/python/cpython/blob/3.10/Lib/html/entities.py#L522-L523), which they apparently do for some entities according to the docs: https://docs.python.org/3/library/html.entities.html#html.entities.html5. So not really a bug, but definitely undesired behavior. Needs checking if the Python HTMLParser can be configured to ignore entities that do not close with a semicolon.

Edit: This is actually in the HTML5 spec: https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references

sissbruecker commented 2 years ago

HTMLParser has a convert_charrefs setting to disable unescaping character data, which for example is applied here: https://github.com/python/cpython/blob/224cd0c3bf2e24f7731fb0b6f31b3839641307d7/Lib/html/parser.py#L161-L164

However the setting is ignored when parsing attribute values 😕: https://github.com/python/cpython/blob/224cd0c3bf2e24f7731fb0b6f31b3839641307d7/Lib/html/parser.py#L324-L325

sissbruecker commented 2 years ago

Looking further into this, there is a related bug report that indicates that in attribute values, character entities not terminated with a semicolon should not be parsed, if they are followed by = or an ASCII character.

AlexanderS commented 1 year ago

Note: The & in the href attribute of the BookmarkHtmlTag should be escaped with &. So this is most likely a bug in the software generating the HTML export.

wahlm commented 1 year ago

I missed to mention that the source for the import was a HTML export from Firefox. I checked again and for every URL (found in a <A HREF="url"> tag containing an ampersand, no escaping to & is done. Whether this is valid and still allowed is controversly discussed and the standards seem to be a bit vague at this point.

Anyway, for any ampersand found at a different position the escaping is done.

So this is the way it is and because of the header <!DOCTYPE NETSCAPE-Bookmark-file-1> this file format may define its own rules.

sissbruecker commented 1 year ago

Tried to get this fixed in the Python HTMLParser some time ago, but it seems there wasn't enough interest: https://github.com/python/cpython/pull/95215

The only other option I can think of is trying a different parser, and check if it had the desired behavior, and similar performance.

sissbruecker / linkding

Unwanted modification of special URLs on import #291