jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.92k stars 2.18k forks source link

`cssSelector` doesn't handle combining characters correctly #1984

Open samshutchins opened 1 year ago

samshutchins commented 1 year ago
    @Test
    void combiningCharactersInIdentifier()
    {
        final String html = """
            <html>
            <head>
            <meta charset="utf-8">
            </head>

            <body>
            <img class="e\u0301" src="/corner.jpg">
            </body>

            </html>""";

        final Document document = Jsoup.parse(html);
        final Elements images = document.getElementsByTag("img");

        final Element img = images.get(0);
        final String cssSelector = img.cssSelector();

        assertEquals("html > body > img.e\u0301", cssSelector);
    }

The example above uses combining characters to create an é. Emoji make heavy use of combining characters (👨‍👨‍👧‍👧 is made up of 11 characters: \uD83D\uDC68\u200D\uD83D\uDC68\u200D\uD83D\uDC67\u200D\uD83D\uDC67).

I have seen emoji used as css class names in the wild, and I think the character escaping code is doing the wrong thing when calling cssSelector, it looks like it's escaping every character individually, which breaks things with these combining characters.

jhy commented 1 year ago

Current jsoup: html > body > img.e\́ Chrome: body > p.e\\u0301

I don't think it's incorrect to emit it as a run of characters. And the selector does work in jsoup. We could improve to escape the combining form as a \u escape character, like Chrome is.