rushter / selectolax

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
MIT License
1.14k stars 69 forks source link

List of selectors throwing "Bad CSS Selectors". #77

Open FanaticPythoner opened 1 year ago

FanaticPythoner commented 1 year ago

Here is a list of CSS selectors which should work, but instead throw "Bad CSS Selectors". I'm using selectolax.parser.HTMLParser with no other parameter than the HTML to parse, as well as the HTMLParser.css() function.

.comment-list>li .comment-list>li ol.children .comment-list>li ol.children li .comment-list>li ol.children li:last-child .comment-list>li:not(:last-child) .comment-list>ol .figure-post>*:first-child .figure-post>*:last-child .figure-postmedia>a .figure-postmedia>a img .figure-postmedia>a img .figure-postcontent>*:first-child .figure-postcontent>*:last-child [data-arts-theme-text=light] .widget_nav_menu ul.menu>li a, .arts-elementor-theme-light .widget_nav_menu ul.menu>li a .header_sticky.bg-dark-1, .header_sticky.bg-dark-2, .header_sticky.bg-dark-3, .header_sticky.bg-dark-4, .header_sticky .menu>li>a .input-float__input_focused+.input-floatlabel, .input-floatinput_not-empty+.input-floatlabel .input-floatinput_focused+.input-floatlabel .trp-language-switcher>div>a .menu>li .menu>li:not(:last-child) .menu>li a .menu .menu-item-has-children>a~ul .menu .sub-menu>li .menu .sub-menu>li a .menu .sub-menu>li a .menu-overlay>li .menu-overlay>li>a .menu-overlay .sub-menu>li .menu-overlay .sub-menu>li>a .modal-footer>:not(:first-child) .modal-footer>:not(:last-child) .postcontent>*:first-child, .postcomments>*:first-child, .section-contentheading>*:first-child, .section-contenttext>*:first-child .postcontent>*:last-child, .postcomments>*:last-child, .section-contentheading>*:last-child, .section-contenttext>*:last-child .postcontent ul:not(.wp-block-gallery) li>span, .postcomments ul:not(.wp-block-gallery) li>span, .section-contentheading ul:not(.wp-block-gallery) li>span, .section-contenttext ul:not(.wp-block-gallery) li>span .postcontent ol:not(.comment-list) li>span, .postcomments ol:not(.comment-list) li>span, .section-contentheading ol:not(.comment-list) li>span, .section-contenttext ol:not(.comment-list) li>span .postcontent>ul, .comment-content>ul, .section-contentheading>ul, .section-contenttext>ul .section-masthead[data-arts-os-animation]:not([data-arts-os-animation=animated])>* .section-mastheadmeta-item>* [data-arts-theme-text=light]:not([data-arts-header-overlay-theme-text=dark]) .split-text:not(.js-split-text) .has-drop-cap>div:first-child, .arts-elementor-theme-light .split-text:not(.js-split-text) .has-drop-cap>div:first-child [data-arts-theme-text=light]:not([data-arts-header-overlay-theme-text=dark]) .input-floatinput_focused+.input-floatlabel, .arts-elementor-theme-light .input-floatinput_focused+.input-floatlabel .split-text:not(.js-split-text) .has-drop-cap>div:first-child .split-text:not(.js-split-text) .has-drop-cap>div:first-child:after .pt-small.offset_bottom .section-offsetcontent, .pt-small.offset_bottom>.elementor-container .pt-medium.offset_bottom .section-offsetcontent, .pt-medium.offset_bottom>.elementor-container .pt-large.offset_bottom .section-offsetcontent, .pt-large.offset_bottom>.elementor-container .pb-small.offset_top .section-offsetcontent, .pb-small.offset_top>.elementor-container .pb-medium.offset_top .section-offsetcontent, .pb-medium.offset_top>.elementor-container .pb-large.offset_top .section-offsetcontent, .pb-large.offset_top>.elementor-container .widget_nav_menu ul.menu>li .widget_nav_menu ul.menu>li a .widget_nav_menu ul.menu>li a:after, .widget_nav_menu ul.menu>li a:before .widget_nav_menu ul.menu>li a .widget_nav_menu ul.menu>li.menu-item-has-children .widget_nav_menu ul.menu>li.menu-item-has-children a:after .widget_nav_menu ul.sub-menu>li .widget_nav_menu ul.sub-menu>li>a .widget_nav_menu ul.sub-menu>li>a .widget_rss ul>li .widget_rss ul>li:last-child .widget_icl_lang_sel_widget .wpml-ls-legacy-dropdown a, .widget_icl_lang_sel_widget .wpml-ls-legacy-dropdown a, .widget_icl_lang_sel_widget .wpml-ls-legacy-dropdown .wpml-ls-current-language>a .widget_text .textwidget>p

rushter commented 1 year ago

Please use spaces between > as recommended in CSS specification.

lexborisov commented 1 year ago

@FanaticPythoner

Please see white-space in selectors. But, in the last lexbor version "combinators" normal work without whitespaces.

FanaticPythoner commented 1 year ago

I understand that. It is still, however, my strong belief that these cases should still be handled by the library, as it is far from everyone that follows best practices. Moreover, people, such as me, obtaining those selectors via parsing the CSS files in a given web page don't have control over whether or not the person who made the CSS declarations in said CSS files followed good practices.

FanaticPythoner commented 1 year ago

@rushter @lexborisov

FanaticPythoner commented 1 year ago

In the meantime, for those reading this thread, you can simply do something like that in your code prior to calling HTMLParser.css():

chars_to_patch = [
    '>',
    '+',
    '~'
]

for c in chars_to_patch:
    c_spaces = ' ' + c + ' '
    if c in selector_str and c_spaces not in selector_str:
        selector_str = selector_str.replace(c, c_spaces).replace('  ', ' ')
rushter commented 1 year ago

I understand that. It is still, however, my strong belief that these cases should still be handled by the library, as it is far from everyone that follows best practices. Moreover, people, such as me, obtaining those selectors via parsing the CSS files in a given web page don't have control over whether or not the person who made the CSS declarations in said CSS files followed good practices.

I can't fix it on my side. Adding whitespaces by using similar approach to yours is risky, so we need this to be fixed in the Modest engine. Is there anything that stops you from using the lexbor backend instead? It's an improved version of the parser, but it's not 100% compatible with the modest.