Open LeaVerou opened 2 years ago
Ideas and related work:
- It's hard to read, even if you want to specify specific characters, you need to find their codepoints
For this, I'd propose allowing <string>
as well, and treating it as specifying the set that is derived from the union of all codepoints in the string. E.g. this would be valid:
unicode-range: "&";
- Broader character classes (e.g. Japanese letter, Emoji, Digit) need to be specified manually, which is error-prone
For this there has already been discussion in https://github.com/w3c/csswg-drafts/issues/4573 and just needs spec edits.
- There is no exclusion syntax (all characters in the font minus these), the range needs to be tediously constructed by starting from
U+0000
and ending atU+FFFF
breaking as needed in between.
Perhaps we need some kind of not operator. This can even be just a keyword (not
, exclude
?) in front of existing value syntax:
<urange> = <urange> | not <urange>
Which would allow things like:
unicode-range: greek, not japanese, not U+A5;
Or a minus operator and a keyword for all characters?
unicode-range: greek except "π";
Yeah, I think [ not? [ <urange> | string | <script-keyword>] ]#
is pretty reasonable, with strings being equivalent to a range that covers all the codepoints of the string. All positive ranges would be added, then all the negative ranges would be subtracted; I don't think there's a real need to subtract from a particular range.
From @tabatkins :
All positive ranges would be added, then all the negative ranges would be subtracted
Presumably if there are no positive ranges, then the starting point would be all characters rather than none.
From @LeaVerou :
unicode-range: greek, not japanese, not U+A5;
Were you expecting that unicode-range: greek, not japanese
would do something different from just unicode-range: greek
? If so, what? (I'm assuming that the greek
and japanese
character ranges don't intersect... though maybe that's an incorrect assumption.)
Were you expecting that
unicode-range: greek, not japanese
would do something different from justunicode-range: greek
? If so, what? (I'm assuming that thegreek
andjapanese
character ranges don't intersect... though maybe that's an incorrect assumption.)
I was just trying to show syntax, but I agree that is a poor example. They definitely don't intersect! The fact that I can't easily think of examples that do intersect probably proves that @tabatkins is right and we don't need to subtract from a particular range.
You might want to start by looking at what Unicode and ICU have done in this space. For example, the UnicodeSet class in ICU4J is similar to the kinds of "range selection" you're describing here--one can add characters according to various Unicode properties, classes, and scripts to build up ranges, invert ranges, etc.
I think the descriptions in the thread above need to be tighter. Are greek
and japanese
supposed to be script names, e.g. equivalent to ISO15924 codes like Grek
and Jpan
? Or are they meant to describe specific character sets, such as the el
(Greek) and ja
locale exemplary sets in CLDR (such as this one)? These kinds of sets definitely do intersect in various ways (and most languages use at least some of the "common" script--think punctuation). I'll also call out that Unicode runs all the way to U+10FFFF
.
I also just has a look at the text mentioned in #4573 located here. Have we (I18N) reviewed this yet?? I think (if I had been the reviewer) I would have proposed issues against the variable width U+
syntax and some of the wording...
For this, I'd propose allowing
as well, and treating it as specifying the set that is derived from the union of all codepoints in the string. E.g. this would be valid: unicode-range: "&";
It may also be useful (i haven't thought it through completely) to allow ranges separated by hyphens, like:
unicode-range: "&¡-§©"
which would include the characters &¡¢£¤¥¦§©. You'd need a way of escaping the - character though.
(Two other cautions about situations where it may be better to stick with code point numbers:
[1] Using characters instead of code point values may cause some difficulty when specifying RTL character sets. For example in
unicode-range: "ذ-خ", "ى", "a-z", "ب-ت";
the underlying order is not what you see (although it could be worse).
[2] You'll probably still want to use code point values for combining characters and invisible characters, and especially for formatting characters such as RLI/LRI etc which will again make the declaration look odd and hard to edit. )
Right, those issues are precisely why I don't think we want to allow string-based ranges, at least not with that syntax. A range(start, end) function could potentially work, if needed. (Tho since all the syntactic characters inside the parens are non-directional it still ends up being very confusingly visibly reordered if viewed in a web-based editor.)
I think ranges are useful but obviously the token that indicates this is a range would need to be outside the string. Eg a function like @tabatkins described or even <string> to <string>
or <string> - <string>
the token that indicates this is a range would need to be outside the string
Not necessarily. On the (probably rare) occasion where - has to be specified as a character it could be escaped (like in regex expressions). In fact, this whole thing sounds very like establishing a regex expression, so perhaps that offers an alternative approach to the syntax?
That would also allow mixing of characters and code point values, eg. if a range you specify starts with a visible character but ends with an invisible one.
I think that makes it harder to read what the range is. I love regex, but it's not exactly known for its readability 😀
That would also allow mixing of characters and code point values, eg. if a range you specify starts with a visible character but ends with an invisible one.
Not sure I follow. If anything it seems to me that doing ranges with syntax outside the string makes this easier.
@aphillips
I also just has a look at the text mentioned in #4573 located here. Have we (I18N) reviewed this yet?? I think (if I had been the reviewer) I would have proposed issues against the variable width
U+
syntax and some of the wording...
That descriptor (and most of the spec text) is from CSS2, in 1998 by the way.
Yeah, if we'd designed it today it would have sucked a whole lot less. That syntax can drink; that syntax has graduated college; that syntax can rent a car without an additional surcharge.
or even
<string> to <string>
Playing with it a bit myself, unfortunately I think we'd be well-served by using a separator token with strong LTR directionality like to
.
If you're trying to denote a range from U+062E (خ) to U+0630 (ذ), you get the following results with a weak directionality vs strong directionality separator:
range("خ" to "ذ")
range("خ", "ذ")
The above two strings are exactly identical save for the separator used, but the bidi algorithm makes the second look like it's in the wrong order.
My abject apologies, once again for the unicode-range syntax.
"Put it in for now, Chris, until we come up with something better" -- Håkon Wium Lie, spring 1997
On the other hand, at least it wasn't the worst syntax proposed. Feast your eyes on the hex-encoded BMP bitmap:
unicode-range: 0x02037FBC4571000003100C000000100010000300BDF74300000000000
That descriptor (and most of the spec text) is from CSS2, in 1998 by the way.
I'll remind Martin and Misha that they missed one. 😆
"Put it in for now, Chris, until we come up with something better" -- Håkon Wium Lie, spring 1997
Well hey, turns out it's had a pretty good run, now let's all just come up with something better! 😁
Right now
unicode-range
accepts everything in terms of codepoints. For example:This has several problems:
U+0000
and ending atU+FFFF
breaking as needed in between.