Provide a shortcut for typing character markup

r12a commented 1 year ago

Is your feature request related to a problem? Please describe. The i18n WG is developing recommendations for referring to one or more characters in markup (see https://w3c.github.io/bp-i18n-specdev/#char_ref_template).

The most basic template for the expanded markup is:

<span class="codepoint" translate="no"><bdi lang="xx">&#xXXXX;</bdi><span class="uname">U+XXXX UNICODE_CHARACTER_NAME</span></span>

This is not complicated, but it's a bit lengthy and fiddly for authors to type in full, especially if a sequence of characters is involved. We'd therefore like to propose a macro that can be used with respec docs to automatically create the full markup from a more concise base.

Describe the solution you'd like We propose the following expansions, where

the textContent can be a code point value, eg. 00E9, or a sequence of space-separated values, eg. 0928 093F;
the textContent can be a character, eg. é, or a sequence of characters, eg. नि
the lang attribute is strongly recommended, and has a BCP47 language code as its value
there is no limit on the number of values provided
hex and character values can't be mixed – the former can be requested using class="hx", and the latter using class="ch"
the character name(s) are automatically inserted by respec

Examples: [1]

<span class="hx" lang="fr">00E9</span>

OR

<span class="ch" lang="fr">é</span>

--->

<span class="codepoint" translate="no"><bdi lang="fr">&#x00E9;</bdi><span class="uname">U+00E9 LATIN SMALL LETTER E WITH ACUTE</span></span>

[2]

<span class="hx" lang="hi">0928 093F</span>

OR

<span class="ch" lang="hi">नि</span>

--->

<span class="codepoint" translate="no"><bdi lang="hi">&#x0928;&#x093F;</bdi><span class="uname">U+0928 DEVANAGARI LETTER NA</span> + <span class="uname">U+093F DEVANAGARI VOWEL SIGN I</span></span>

It may also be useful to have a way of indicating that no bdi element is wanted (although much of the time an image would be useful as a replacement). Maybe something like:

<span class="hx nobdi" lang="en">00A0</span>

For invisible characters or tricky to display characters (such as certain combining marks), more complete solution would allow for an image in the expanded markup. For example:

<span class="codepoint" translate="no"><img src="mypath/2003.png" alt="&#x2003;"><span class="uname">U+2003: EM SPACE</span></span>

If it's possible to standardise or accept user input wrt the image location, this could be achieved with a shorthand such as the following, where an additional class name of img or svg is used.

<span class="hx img" lang="ja">2003</span>

(Btw, I can provide a set of images for invisible characters, eg. U+2003 .)

Additional context Note that there is intentionally no span between </bdi>. The gap will be provided by styling (which avoids problems with variable space widths and makes it possible to reduce the gap or change it at scale if needed).

r12a commented 1 year ago

The nobdi class name may be better nochar, or some such.

r12a commented 1 year ago

Btw, i have code that could be adapted to make this work.

sidvishnoi commented 1 year ago

Hi @r12a. Having some example code would be good if you've done such thing before. I'm not sure if there's some API we can use or do we need to maintain a list of all the characters (to convert 00A0 to U+00E9 LATIN SMALL LETTER E WITH ACUTE; we can perhaps use Intl.Segmenter to split chars as needed though (no Firefox support there though)). If we need a list, I'm not sure if it'd make sense to bundle it for all users or even how that list would be maintained. It maybe better suited as a plugin.

r12a commented 1 year ago

@sidvishnoi The code i use for my own pages may help.

Note, however, that my code has some differences, built in to the way i use it. These include:

i can get the language from the context, rather than a parameter
i know where to look for the images - i have my own set - that will need a different solution
i use dedicated character databases to retrieve the name (spreadsheetRows) - you'll need to use a list derived from the Unicode database (and updated for each Unicode release) - i have one of these at https://github.com/r12a/shared/blob/gh-pages/code/all-names.js - probably best to do this conversion on the server though, given the size of the file (even if it's compacted)

But there's probably a good deal of the algorithm that's useful.

Search for the expandCharMarkup function at https://github.com/r12a/scripts/blob/gh-pages/common29/functions.js

to convert 00A0 to U+00E9 LATIN SMALL LETTER E WITH ACUTE; we can perhaps use Intl.Segmenter to split chars as needed

Not quite sure what you mean here. If you simply want to get a list of the characters in the textContent, that's easy, use the ... operator (eg. charlist = [... charMarkup[i].textContent])

Don't know whether that helps. Let me know.

r12a commented 1 year ago

Note that i just changed one of the links in the previous comment.

sidvishnoi commented 1 year ago

We likely want to go with the approach we use in core/xref here, as the database is quite large (1.4MB) to be included in ReSpec main bundle. i.e., we'll have an endpoint at respec.org and fetch details from there.

I'll try to find some time this month. PRs welcome to respec-web-services as well as ReSpec - having either will help us move forward.

r12a commented 11 months ago

Hello @sidvishnoi . The i18n WG is asking me whether we are able to make progress on this. The full markup is now described in https://www.w3.org/StyleSheets/TR/2021/README.html#unicode-codepoints

sidvishnoi commented 11 months ago

@r12a Not at the moment from my side, sorry. I'll have to get free from my daily job (or time-off) to focus on this. Happy to review any pull requests related to this though.

r12a commented 8 months ago

@sidvishnoi i'm just checking whether you are likely to have time to look at this again? Cheers.

sidvishnoi commented 8 months ago

@r12a not at the moment unfortunately. Maybe in March as my current work contract ends then.

r12a commented 6 months ago

@sidvishnoi ping ?

sidvishnoi commented 6 months ago

I'll try after next Thursday... Sorry to keep you hanging. Would definitely appreciate a PR from community, even if partial though.

r12a commented 6 months ago

Thanks @sidvishnoi . I can't really create any PR, but there's a link to my code, fwiw, above. I also have a (new) list of class names that i use to manage the output at https://r12a.github.io/scripts/template/xx.html#template_codepoints – this may well be far more than we need for W3C docs, but i point to it for what value it may have. I think the key thing is to be able to go from x or XXX to the full syntax. hth

sidvishnoi commented 5 months ago

Here's the plan:

Create a backend API at respec.org/unicode/names.
- For first pass, I'll simply copy-paste https://github.com/r12a/shared/blob/gh-pages/code/all-names.js as data source.
- Later, we'd need to parse https://unicode.org/Public/UNIDATA/UnicodeData.txt to get the latest mapping (@r12a I assume you've written a parser for this file?). File format for reference: https://www.unicode.org/Public/5.1.0/ucd/UCD.html#UnicodeData.txt
- API:
```
// request
{
query: [
 { codepoint: "0928" },
 { /* ...more */ },
],
options: { /* TBD */ }
}

// response
{
data: [
 {
   query: { codepoint: "0928" },
   result: { name: "DEVANAGARI LETTER NA" }, // or null if not found
 },
 { /* ...more */ },
],
metadata: { unicodeVersion, lastParsedAt },
}
```
In ReSpec:
1. Get code points with [...textContent.trim()].map(e => e.codePointAt(0).toString(16))
2. De-dupe queries, bulk POST request to above endpoint (similar to xref, use IndexedDB as a cache too), and
3. map elements to names, expanding the shorthand.
4. The class names ch, hx, img and nobdi seem good to me (can probably use char and hex). IDK if we want to support graphemes here, i think not.
 - Might be better idea to use a custom element to avoid adding so many (global) classes? @marcoscaceres e.g. <respec-unicode hex="HEX" img></respec-unicode> or <respec-unicode hex>HEX</respec-unicode> (it'll replace itself with right markup). We can then maybe even publish it (in future) as separate script, without needing to include it in ReSpec core (added benefits like popups with details).

Images

I think we'd want W3C to host them.
- We can return the URL in above response too. Can cache images indefinitely by adding a hash in filename.
I think I won't add support for images in first pass.
I wonder if we'd need to support images for graphemes @r12a? Like returning नि as image instead of individual code point images.

`respecConfig`

I don't think there is any needed, but can probably allow overriding image URLs, something like:

respecConfig.unicode = {
  images: (codePointAsNumberOrGraphemeAsString) => URL
}

My plan is to implement it this Sunday. I've done enough reading to get started now :)

r12a commented 5 months ago

Later, we'd need to parse https://unicode.org/Public/UNIDATA/UnicodeData.txt to get the latest mapping (@r12a I assume you've written a parser for this file?).

I have, indeed, and i update the file as soon as each new Unicode release occurs (it's needed for my own tools, such as UniView). We can decide how to manage updates later. There are a couple of choices.

The class names ch, hx, img and nobdi seem good to me (can probably use char and hex). IDK if we want to support graphemes here, i think not.

I originally used char and hex class names, but it slowed down the content authoring, so i switched to ch and hx. That makes it much faster to type the code (esp. in DreamWeaver, where i just type span.ch[tab] to get |). So i recommend keeping the shorter forms.

I'm not sure why you suggest nobdi. I can't think of a situation where you'd not want to have bdi. (It's harmless when not needed.)

Not sure what you mean by supporting graphemes, but it's absolutely important to allow a sequence of characters rather than just a single character – eg. abc should work. Remember that many languages use combining marks and multi-character text units as a basic typographic unit. And often the positioning of elements within a sequence can often be problematic if you don't have a webfont or image to control what it looks like.

Might be better idea to use a custom element

It's a lot of typing and it's not portable (For example, I'm likely to want to copy (a lot of) stuff between my own stuff and the i18n lreq docs), so i prefer the span & classname approach.

Images. I think we'd want W3C to host them. I think I won't add support for images in first pass.

They are very useful, though – especially when talking about invisible or ambiguous Unicode characters, so i'd encourage you to support them out of the gate if you can. Of course

I understand that my own setup is a lot simpler and more efficient than what we'd implement for respec. Not least because the images (aiui) would need to packaged in the same directory as a document that is being published, to pass publication rules. So my assumption was that WGs would make or find images for use here, and store them locally. (I don't mind if people copy images from my set on GH, but i wouldn't expect documents to pull images from that location.)

That said, i think it is important for people to be able to include images in the document, rather than only characters – especially for ambiguous or invisible characters. (The i18n WG already does this in some of their documents.)

hope that helps

r12a commented 5 months ago

Btw, the other class name values, such as split, circle, coda, init, etc. are very useful, and shouldn't be too hard to implement given that i've already done so in my code (albeit my function could do with rewriting to simplify it, but the logic is there). (See https://r12a.github.io/scripts/template/xx.html#template_codepoints)

r12a commented 5 months ago

Oh, and if nobdi means 'show only the Unicode name', perhaps it would be better named as 'nameonly' or some such. Most people don't know what bdi is, let alone know that it will appear in the resulting code.

sidvishnoi commented 5 months ago

where i just type span.ch[tab]...

Agreed. This is a strong argument for using classes over custom element.

Remember that many languages use combining marks and multi-character text units as a basic typographic unit. And often the positioning of elements within a sequence can often be problematic if you don't have a webfont or image to control what it looks like.

This is my concern. Consider नि. Will we need to return an image as नि, or as separate characters? Do we have images for all such combined characters somewhere? Also, if नि, then then would we need to return images for full words such as नियुक्ति too? I guess would make sense to use webfont in that case - but then would ReSpec need to add these webfonts too?

Support for images for control/invisible characters is reasonable. That's something we can definitely support out of box in first pass.

Btw, the other class name values, such as split, circle, coda, init, etc. are very useful, and shouldn't be too hard to implement given that i've already done so in my code

With all these classes and special features, I wonder if it would make sense for ReSpec to support it. How about we provide a backend API (via respec.org), and then i18n specs can use a custom preProcess or postProcess plugin to handle this specific logic?

My concern being all these classes is they're are tied to unicode expansion plugin, but being classes they're "too global". This is why I was looking to encapsulate it with custom element. But I guess we can take these classes into account only with .ch or .hex prefixes to avoid clashes in future, and remove the classes as soon as element gets processed.

Not least because the images (aiui) would need to packaged in the same directory as a document that is being published, to pass publication rules. So my assumption was that WGs would make or find images for use here, and store them locally.

I think if we host images on W3C/Unicode servers, pubrules can be modified to allow those URL prefixes. Storing images locally is fine too - I'm hoping we can make a page at W3C or Unicode servers (or even respec.org in worst case) to support that. Do note that tools like w3c/spec-prod could download these remotely referenced images before publishing to /TR, so hosting them anywhere shouldn't be a problem.

r12a commented 5 months ago

But I guess we can take these classes into account only with .ch or .hex prefixes to avoid clashes in future, and remove the classes as soon as element gets processed.

Yes, that's what i would expect to happen (both points).

I think if we host images on W3C/Unicode servers, pubrules can be modified to allow those URL prefixes. Storing images locally is fine too

There are a few problems here with hosting images:

you need to source the images. I have a set of just over 71,000 images for my own documents, but those don't cover Chinese, Korean, or Tangut blocks (which contains many tens of thousands of more characters).
each year when the Unicode Standard is updated it's necessary to create new images for the new characters, but also usually there are additional changes to existing reference character shapes which also need to be updated. That's a lot of work, and we probably don't want images already used for a published spec/document to change by default during these updates anyway.
most people won't use the vast majority of the available images anyway. They'll only need a few per document (though they might use them multiple times).
for my own stuff, i have to have webfonts anyway, so i only use the images for particular cases (mostly invisible / ambiguous, but sometimes for combining marks if the font isn't great – i'm working with many long tail languages). This gives me sufficient control over the rendering that i don't usually need graphics for character sequences. But for W3C specs, i think there will be more interest in using images rather than webfonts for showing the characters. And people will likely want to be able to use images for some sequences, such as नि etc. So being able to create your own images and reference them using a simple syntax seems the best option to me (for the W3C use case).

would we need to return images for full words such as नियुक्ति too?

In principle, yes, but bear in mind that this is really aimed at single characters or small numbers of characters. Otherwise the following Unicode names grow very long. When i want to show full words i will typically create a figure or another mechanism which hides the character names but allows you to discover them, if needed.

Do we have images for all such combined characters somewhere?

No. That would be a vast collection. I'm proposing that the WG creates or sources just the images it needs, but that the respec authoring would allow them to easily show those images with attached Unicode names.

sidvishnoi commented 5 months ago

Seems like respecConfig.unicode.images function would be best way to support images then. We can pass codepoint, full text (of shorthand element) as well as reference to that element as parameters. The WG/spec can store images at a convenient location with file name such that it makes the images function simple.

aphillips commented 3 months ago

@sidvishnoi The I18N WG has been waiting on this feature for ~1 year now. I was actioned with writing to follow up on when this might make an appearance or how we can help with implementation. Can you ping back with an estimate?

sidvishnoi commented 3 months ago

Hi, sorry for the long wait. I've started work on it just today finally.

I think we should be able to get a working version within 2-3 weeks, as I try to find time from my regular job - please remember I'm a volunteer here.

I'm not sure about the images part yet, but the invisible characters ones should be there in initial version.

how we can help with implementation

Whom can I ask for reviewing the code to ensure it meets the expectations?
I'd love if someone send some PRs, even if partial.
Sending some test cases would be very helpful. I'd of course use the ones mentioned in original description.
A contribution to ReSpec's Open Collective would be appreciated to support maintainer efforts.

aphillips commented 3 months ago

@sidvishnoi Thank you so much! We appreciate your working on this--we understand about the volunteer aspect.

@r12a, @xfq and I can help with CRs.

aphillips commented 1 month ago

Just a ping @sidvishnoi. Any progress? Any way we can help?

sidvishnoi commented 1 month ago

Hi @aphillips, there's in indeed some progress this time, but I couldn't reach completion, as you can see in linked PRs: https://github.com/speced/respec/pull/4748 and https://github.com/speced/respec-web-services/pull/424.

@xfq Can you provide set of images? I'll add those when I resume work on this, hopefully in one of coming two weekends.

If someone can add more test cases to https://github.com/speced/respec/pull/4748, that would help encourage me and also set expectations on requirements necessary for an initial release.

speced / respec

Provide a shortcut for typing character markup #4462

Images

`respecConfig`