w3c / csswg-drafts

CSS Working Group Editor Drafts
https://drafts.csswg.org/
Other
4.36k stars 641 forks source link

[css-fonts-5] Make `unicode-range` syntax suck less #7921

Open LeaVerou opened 1 year ago

LeaVerou commented 1 year ago

Right now unicode-range accepts everything in terms of codepoints. For example:

/* yen, kanji, hiragana, katakana */
unicode-range: U+A5, U+4E00-9FFF, U+30??, U+FF00-FF9F;

This has several problems:

  1. It's hard to read, even if you want to specify specific characters, you need to find their codepoints
  2. Broader character classes (e.g. Japanese letter, Emoji, Digit) need to be specified manually, which is error-prone
  3. There is no exclusion syntax (all characters in the font minus these), the range needs to be tediously constructed by starting from U+0000 and ending at U+FFFF breaking as needed in between.
LeaVerou commented 1 year ago

Ideas and related work:

  1. It's hard to read, even if you want to specify specific characters, you need to find their codepoints

For this, I'd propose allowing <string> as well, and treating it as specifying the set that is derived from the union of all codepoints in the string. E.g. this would be valid:

unicode-range: "&";
  1. Broader character classes (e.g. Japanese letter, Emoji, Digit) need to be specified manually, which is error-prone

For this there has already been discussion in https://github.com/w3c/csswg-drafts/issues/4573 and just needs spec edits.

  1. There is no exclusion syntax (all characters in the font minus these), the range needs to be tediously constructed by starting from U+0000 and ending at U+FFFF breaking as needed in between.

Perhaps we need some kind of not operator. This can even be just a keyword (not, exclude?) in front of existing value syntax:

<urange> = <urange> | not <urange>

Which would allow things like:

unicode-range: greek, not japanese, not U+A5;

Or a minus operator and a keyword for all characters?

unicode-range: greek except "π";
tabatkins commented 1 year ago

Yeah, I think [ not? [ <urange> | string | <script-keyword>] ]# is pretty reasonable, with strings being equivalent to a range that covers all the codepoints of the string. All positive ranges would be added, then all the negative ranges would be subtracted; I don't think there's a real need to subtract from a particular range.

dbaron commented 1 year ago

From @tabatkins :

All positive ranges would be added, then all the negative ranges would be subtracted

Presumably if there are no positive ranges, then the starting point would be all characters rather than none.

From @LeaVerou :

unicode-range: greek, not japanese, not U+A5;

Were you expecting that unicode-range: greek, not japanese would do something different from just unicode-range: greek? If so, what? (I'm assuming that the greek and japanese character ranges don't intersect... though maybe that's an incorrect assumption.)

LeaVerou commented 1 year ago

Were you expecting that unicode-range: greek, not japanese would do something different from just unicode-range: greek? If so, what? (I'm assuming that the greek and japanese character ranges don't intersect... though maybe that's an incorrect assumption.)

I was just trying to show syntax, but I agree that is a poor example. They definitely don't intersect! The fact that I can't easily think of examples that do intersect probably proves that @tabatkins is right and we don't need to subtract from a particular range.

aphillips commented 1 year ago

You might want to start by looking at what Unicode and ICU have done in this space. For example, the UnicodeSet class in ICU4J is similar to the kinds of "range selection" you're describing here--one can add characters according to various Unicode properties, classes, and scripts to build up ranges, invert ranges, etc.

I think the descriptions in the thread above need to be tighter. Are greek and japanese supposed to be script names, e.g. equivalent to ISO15924 codes like Grek and Jpan? Or are they meant to describe specific character sets, such as the el (Greek) and ja locale exemplary sets in CLDR (such as this one)? These kinds of sets definitely do intersect in various ways (and most languages use at least some of the "common" script--think punctuation). I'll also call out that Unicode runs all the way to U+10FFFF.

I also just has a look at the text mentioned in #4573 located here. Have we (I18N) reviewed this yet?? I think (if I had been the reviewer) I would have proposed issues against the variable width U+ syntax and some of the wording...

r12a commented 1 year ago

For this, I'd propose allowing as well, and treating it as specifying the set that is derived from the union of all codepoints in the string. E.g. this would be valid:

unicode-range: "&";

It may also be useful (i haven't thought it through completely) to allow ranges separated by hyphens, like:

unicode-range: "&¡-§©"

which would include the characters &¡¢£¤¥¦§©. You'd need a way of escaping the - character though.

(Two other cautions about situations where it may be better to stick with code point numbers:

[1] Using characters instead of code point values may cause some difficulty when specifying RTL character sets. For example in

unicode-range: "ذ-خ", "ى", "a-z", "ب-ت";

the underlying order is not what you see (although it could be worse).

[2] You'll probably still want to use code point values for combining characters and invisible characters, and especially for formatting characters such as RLI/LRI etc which will again make the declaration look odd and hard to edit. )

tabatkins commented 1 year ago

Right, those issues are precisely why I don't think we want to allow string-based ranges, at least not with that syntax. A range(start, end) function could potentially work, if needed. (Tho since all the syntactic characters inside the parens are non-directional it still ends up being very confusingly visibly reordered if viewed in a web-based editor.)

LeaVerou commented 1 year ago

I think ranges are useful but obviously the token that indicates this is a range would need to be outside the string. Eg a function like @tabatkins described or even <string> to <string> or <string> - <string>

r12a commented 1 year ago

the token that indicates this is a range would need to be outside the string

Not necessarily. On the (probably rare) occasion where - has to be specified as a character it could be escaped (like in regex expressions). In fact, this whole thing sounds very like establishing a regex expression, so perhaps that offers an alternative approach to the syntax?

That would also allow mixing of characters and code point values, eg. if a range you specify starts with a visible character but ends with an invisible one.

LeaVerou commented 1 year ago

I think that makes it harder to read what the range is. I love regex, but it's not exactly known for its readability 😀

That would also allow mixing of characters and code point values, eg. if a range you specify starts with a visible character but ends with an invisible one.

Not sure I follow. If anything it seems to me that doing ranges with syntax outside the string makes this easier.

svgeesus commented 2 months ago

@aphillips

I also just has a look at the text mentioned in #4573 located here. Have we (I18N) reviewed this yet?? I think (if I had been the reviewer) I would have proposed issues against the variable width U+ syntax and some of the wording...

That descriptor (and most of the spec text) is from CSS2, in 1998 by the way.

tabatkins commented 2 months ago

Yeah, if we'd designed it today it would have sucked a whole lot less. That syntax can drink; that syntax has graduated college; that syntax can rent a car without an additional surcharge.

tabatkins commented 2 months ago

or even <string> to <string>

Playing with it a bit myself, unfortunately I think we'd be well-served by using a separator token with strong LTR directionality like to.

If you're trying to denote a range from U+062E (خ) to U+0630 (ذ), you get the following results with a weak directionality vs strong directionality separator:

range("خ" to "ذ")

range("خ", "ذ")

The above two strings are exactly identical save for the separator used, but the bidi algorithm makes the second look like it's in the wrong order.

svgeesus commented 2 months ago

My abject apologies, once again for the unicode-range syntax.

image

"Put it in for now, Chris, until we come up with something better" -- Håkon Wium Lie, spring 1997

On the other hand, at least it wasn't the worst syntax proposed. Feast your eyes on the hex-encoded BMP bitmap:

unicode-range: 0x02037FBC4571000003100C000000100010000300BDF74300000000000

aphillips commented 2 months ago

That descriptor (and most of the spec text) is from CSS2, in 1998 by the way.

I'll remind Martin and Misha that they missed one. 😆