kuchiki-rs / kuchiki

(朽木) HTML/XML tree manipulation library for Rust
MIT License
470 stars 54 forks source link

Question on naming conventions of ids starting with a number. #79

Closed hipstermojo closed 4 years ago

hipstermojo commented 4 years ago

It seems that Kuchiki will return an Err when calling select_first if the id begins with a number. For example if the html has something like this,

<p id="1">Some foo content</p>

This would be accessed by calling:

let p_node = node_ref.select_first("p#1").unwrap();

However this will just return an Err. Is this a bug in the way a CSS selector is parsed or is it that the CSS spec requires ids to be named starting with an alphabetic character?

hipstermojo commented 4 years ago

I'll just go ahead and answer my own question. The HTML5 spec has no restrictions on what to use for naming ids, however the HTML4 spec does. So I guess the CSS selector seems to use the older standard.

jdm commented 4 years ago

https://drafts.csswg.org/selectors-3/#id-selectors uses https://www.w3.org/TR/CSS21/syndata.html#value-def-identifier, which says "they cannot start with a digit, two hyphens, or a hyphen followed by a digit".

SimonSapin commented 4 years ago

The syntax or restrictions on HTML id attributes are entirely separate from those of CSS ID selectors.

The former can have HTML escape sequences, so the ID value for <p id="&quot;"> is one double-quote character.

The latter relies is # followed by a CSS identifier, which indeed cannot start by an ASCII digit. However it is possible to write a CSS identifier that represents a value that starts with a digit, by escaping it.

CSS escape sequences are a backslash followed by either the character to be escaped, or by a sequence of hexadecimal digits representing the Unicode code point (followed by an optional space, to separate following digits that are not meant to be part of the escape sequence). To resolve ambiguities, hex digits can only be escaped as their code point value.

TL;DR: to select <p id="1">, use select_first("p#\31") since U+0031 is the ASCII digit 1.

More details: https://mathiasbynens.be/notes/css-escapes

hipstermojo commented 4 years ago

Oh wow! Thanks for clearing this up for me. While it's probably an edge case scenario, I'll be sure to keep this in mind.