scrapy / cssselect

CSS Selectors for Python
https://cssselect.readthedocs.io/
Other
290 stars 61 forks source link

ID selector syntax #24

Open SimonSapin opened 11 years ago

SimonSapin commented 11 years ago

The spec’ed syntax for ID selectors is # followed by an identifier, not any hash token. See the "type" flag on hash tokens in css-syntax.

Gallaecio commented 5 years ago

I’m not sure if this is a feature request or a bug :sweat_smile:

SimonSapin commented 5 years ago

This is a bug report. The tokenizer code does not match the Selectors specification:

https://drafts.csswg.org/selectors/#id-selectors

An ID selector consists of a “number sign” (U+0023, #) immediately followed by the ID value, which must be a CSS identifier.

The syntax for identifiers has some restrictions. For example it cannot start wit a digit, so 37signals is not a valid identifier.

A # character followed by an identifier is parsed in CSS 2 as a HASH token, in CSS Syntax Level 3 as a <hash-token>. (The two definitions should be equivalent, just written in a different style.) But the hash token is less restrictive than that (because it is also used in e.g. color: #00FF55;).

Some inputs like #37signals do tokenize as a hash token, but are not made of # followed by a valid identifiers (again because of starting with a digit, in this example). So they should not parse as ID selectors. CSS Syntax’s tokenizer algorithm sets a type flag for exactly this purpose.

In Python terms, css_to_xpath("#37signals") should raise a SelectorSyntaxError.

Gallaecio commented 3 years ago

I’ve been looking into this, and I see that a lot of people has gotten bitten by this when trying to style elements with such IDs, which are valid HTML IDs, but for some reason not valid CSS IDs.

Instead of adjusting our code to the standard, I wonder if it would not be better to keep support for this syntax, and instead indicate this explicitly as a feature of cssselect in https://cssselect.readthedocs.io/en/latest/#supported-selectors. Otherwise, this change can break existing selectors, and require a harder-to-read syntax for these cases.

Unless there’s a good reason not to allow this syntax, i.e. a selector that should work 1 way and because of supporting this syntax it works unexpectedly, I really think it would be best to keep the implementation as it is.

@annbgn I’m sorry to suggest this when you have already worked on a patch.