jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.87k stars 2.17k forks source link

`SelectorParseException` when calling `Element#cssSelector()` #1966

Closed remi-sf closed 1 week ago

remi-sf commented 1 year ago

Hi,

My team have encountered this crash trying to blindly call Element#cssSelector() on elements.

The signature is:

org.jsoup.select.Selector$SelectorParseException: Could not parse query 'ul.sp-c-sport-flyout__inner.gs-u-mb\': unexpected token at '\'

    at org.jsoup.select.QueryParser.findElements(QueryParser.java:226)
    at org.jsoup.select.QueryParser.parse(QueryParser.java:74)
    at org.jsoup.select.QueryParser.parse(QueryParser.java:45)
    at org.jsoup.select.QueryParser.combinator(QueryParser.java:90)
    at org.jsoup.select.QueryParser.parse(QueryParser.java:60)
    at org.jsoup.select.QueryParser.parse(QueryParser.java:45)
    at org.jsoup.select.Selector.select(Selector.java:98)
    at org.jsoup.nodes.Element.select(Element.java:418)
    at org.jsoup.nodes.Element.cssSelector(Element.java:858)

To reproduce this, run the following test case:

void test() throws IOException
    {
        final String html = "<ul class=\"sp-c-sport-flyout__inner gs-u-mb+ gs-u-display-none@m qa-flyout-primary\"><li class=\"sp-c-sport-flyout__item \" role=\"presentation\"><a class=\"sp-c-sport-flyout__link qa-flyout-primary-item sp-nav-click-stat\" role=\"menuitem\" data-stat-name=\"primary-nav-v2-mobile\" data-stat-title=\"Home\" data-stat-link=\"/sport\" href=\"/sport\">Home</a></li></ul>";
        final Document document = Jsoup.parse(html);
        document.getElementsByTag("ul").get(0).cssSelector();
    }

The class gb-u-mb+ is causing the crash, and removing it from the HTML avoids the crash. I suppose the + character is invalid for a CSS class? In which case, this might not really be a bug and we'll just have to handle the runtime exception in our application.

The HTML comes from the web page in the attached archive: Transfer news live & West Ham in Europa Conference League final - Live - BBC Sport.html.zip

(Reproduced in JSoup 1.15.4)

erfansn commented 7 months ago

I agree, similar issue when parsing "td:first-child" in testing environment but in production anything is fine!

jhy commented 1 week ago

Thanks; this was fixed along with #2146.