Open r12a opened 1 year ago
The nobdi
class name may be better nochar
, or some such.
Btw, i have code that could be adapted to make this work.
Hi @r12a. Having some example code would be good if you've done such thing before. I'm not sure if there's some API we can use or do we need to maintain a list of all the characters (to convert 00A0
to U+00E9 LATIN SMALL LETTER E WITH ACUTE
; we can perhaps use Intl.Segmenter
to split chars as needed though (no Firefox support there though)). If we need a list, I'm not sure if it'd make sense to bundle it for all users or even how that list would be maintained. It maybe better suited as a plugin.
@sidvishnoi The code i use for my own pages may help.
Note, however, that my code has some differences, built in to the way i use it. These include:
spreadsheetRows
) - you'll need to use a list derived from the Unicode database (and updated for each Unicode release) - i have one of these at https://github.com/r12a/shared/blob/gh-pages/code/all-names.js - probably best to do this conversion on the server though, given the size of the file (even if it's compacted)But there's probably a good deal of the algorithm that's useful.
Search for the expandCharMarkup
function at https://github.com/r12a/scripts/blob/gh-pages/common29/functions.js
to convert 00A0 to U+00E9 LATIN SMALL LETTER E WITH ACUTE; we can perhaps use Intl.Segmenter to split chars as needed
Not quite sure what you mean here. If you simply want to get a list of the characters in the textContent, that's easy, use the ... operator (eg. charlist = [... charMarkup[i].textContent]
)
Don't know whether that helps. Let me know.
Note that i just changed one of the links in the previous comment.
We likely want to go with the approach we use in core/xref
here, as the database is quite large (1.4MB) to be included in ReSpec main bundle. i.e., we'll have an endpoint at respec.org and fetch details from there.
I'll try to find some time this month. PRs welcome to respec-web-services as well as ReSpec - having either will help us move forward.
Hello @sidvishnoi . The i18n WG is asking me whether we are able to make progress on this. The full markup is now described in https://www.w3.org/StyleSheets/TR/2021/README.html#unicode-codepoints
@r12a Not at the moment from my side, sorry. I'll have to get free from my daily job (or time-off) to focus on this. Happy to review any pull requests related to this though.
@sidvishnoi i'm just checking whether you are likely to have time to look at this again? Cheers.
@r12a not at the moment unfortunately. Maybe in March as my current work contract ends then.
@sidvishnoi ping ?
I'll try after next Thursday... Sorry to keep you hanging. Would definitely appreciate a PR from community, even if partial though.
Thanks @sidvishnoi . I can't really create any PR, but there's a link to my code, fwiw, above. I also have a (new) list of class names that i use to manage the output at https://r12a.github.io/scripts/template/xx.html#template_codepoints – this may well be far more than we need for W3C docs, but i point to it for what value it may have. I think the key thing is to be able to go from <span class="ch">x</span>
or <span class="hx">XXX</span>
to the full syntax. hth
Here's the plan:
Create a backend API at respec.org/unicode/names
.
API:
// request
{
query: [
{ codepoint: "0928" },
{ /* ...more */ },
],
options: { /* TBD */ }
}
// response
{
data: [
{
query: { codepoint: "0928" },
result: { name: "DEVANAGARI LETTER NA" }, // or null if not found
},
{ /* ...more */ },
],
metadata: { unicodeVersion, lastParsedAt },
}
[...textContent.trim()].map(e => e.codePointAt(0).toString(16))
ch
, hx
, img
and nobdi
seem good to me (can probably use char
and hex
). IDK if we want to support graphemes here, i think not.
<respec-unicode hex="HEX" img></respec-unicode>
or <respec-unicode hex>HEX</respec-unicode>
(it'll replace itself with right markup). We can then maybe even publish it (in future) as separate script, without needing to include it in ReSpec core (added benefits like popups with details).नि
as image instead of individual code point images.respecConfig
I don't think there is any needed, but can probably allow overriding image URLs, something like:
respecConfig.unicode = {
images: (codePointAsNumberOrGraphemeAsString) => URL
}
My plan is to implement it this Sunday. I've done enough reading to get started now :)
Later, we'd need to parse https://unicode.org/Public/UNIDATA/UnicodeData.txt to get the latest mapping (@r12a I assume you've written a parser for this file?).
I have, indeed, and i update the file as soon as each new Unicode release occurs (it's needed for my own tools, such as UniView). We can decide how to manage updates later. There are a couple of choices.
The class names ch, hx, img and nobdi seem good to me (can probably use char and hex). IDK if we want to support graphemes here, i think not.
I originally used char and hex class names, but it slowed down the content authoring, so i switched to ch and hx. That makes it much faster to type the code (esp. in DreamWeaver, where i just type span.ch[tab]
to get <span class="ch">|</span>
). So i recommend keeping the shorter forms.
I'm not sure why you suggest nobdi. I can't think of a situation where you'd not want to have bdi. (It's harmless when not needed.)
Not sure what you mean by supporting graphemes, but it's absolutely important to allow a sequence of characters rather than just a single character – eg. <span class="ch">abc</span>
should work. Remember that many languages use combining marks and multi-character text units as a basic typographic unit. And often the positioning of elements within a sequence can often be problematic if you don't have a webfont or image to control what it looks like.
Might be better idea to use a custom element
It's a lot of typing and it's not portable (For example, I'm likely to want to copy (a lot of) stuff between my own stuff and the i18n lreq docs), so i prefer the span & classname approach.
Images. I think we'd want W3C to host them. I think I won't add support for images in first pass.
They are very useful, though – especially when talking about invisible or ambiguous Unicode characters, so i'd encourage you to support them out of the gate if you can. Of course
I understand that my own setup is a lot simpler and more efficient than what we'd implement for respec. Not least because the images (aiui) would need to packaged in the same directory as a document that is being published, to pass publication rules. So my assumption was that WGs would make or find images for use here, and store them locally. (I don't mind if people copy images from my set on GH, but i wouldn't expect documents to pull images from that location.)
That said, i think it is important for people to be able to include images in the document, rather than only characters – especially for ambiguous or invisible characters. (The i18n WG already does this in some of their documents.)
hope that helps
Btw, the other class name values, such as split
, circle
, coda
, init
, etc. are very useful, and shouldn't be too hard to implement given that i've already done so in my code (albeit my function could do with rewriting to simplify it, but the logic is there). (See https://r12a.github.io/scripts/template/xx.html#template_codepoints)
Oh, and if nobdi
means 'show only the Unicode name', perhaps it would be better named as 'nameonly' or some such. Most people don't know what bdi is, let alone know that it will appear in the resulting code.
where i just type
span.ch[tab]
...
Agreed. This is a strong argument for using classes over custom element.
Remember that many languages use combining marks and multi-character text units as a basic typographic unit. And often the positioning of elements within a sequence can often be problematic if you don't have a webfont or image to control what it looks like.
This is my concern. Consider <span class="ch img" lang="hi">नि</span>
. Will we need to return an image as नि
, or as separate characters? Do we have images for all such combined characters somewhere? Also, if नि
, then then would we need to return images for full words such as नियुक्ति
too? I guess would make sense to use webfont in that case - but then would ReSpec need to add these webfonts too?
Support for images for control/invisible characters is reasonable. That's something we can definitely support out of box in first pass.
Btw, the other class name values, such as split, circle, coda, init, etc. are very useful, and shouldn't be too hard to implement given that i've already done so in my code
With all these classes and special features, I wonder if it would make sense for ReSpec to support it. How about we provide a backend API (via respec.org), and then i18n specs can use a custom preProcess or postProcess plugin to handle this specific logic?
My concern being all these classes is they're are tied to unicode expansion plugin, but being classes they're "too global". This is why I was looking to encapsulate it with custom element. But I guess we can take these classes into account only with .ch
or .hex
prefixes to avoid clashes in future, and remove the classes as soon as element gets processed.
Not least because the images (aiui) would need to packaged in the same directory as a document that is being published, to pass publication rules. So my assumption was that WGs would make or find images for use here, and store them locally.
I think if we host images on W3C/Unicode servers, pubrules can be modified to allow those URL prefixes. Storing images locally is fine too - I'm hoping we can make a page at W3C or Unicode servers (or even respec.org in worst case) to support that.
Do note that tools like w3c/spec-prod
could download these remotely referenced images before publishing to /TR, so hosting them anywhere shouldn't be a problem.
But I guess we can take these classes into account only with .ch or .hex prefixes to avoid clashes in future, and remove the classes as soon as element gets processed.
Yes, that's what i would expect to happen (both points).
I think if we host images on W3C/Unicode servers, pubrules can be modified to allow those URL prefixes. Storing images locally is fine too
There are a few problems here with hosting images:
would we need to return images for full words such as नियुक्ति too?
In principle, yes, but bear in mind that this is really aimed at single characters or small numbers of characters. Otherwise the following Unicode names grow very long. When i want to show full words i will typically create a figure or another mechanism which hides the character names but allows you to discover them, if needed.
Do we have images for all such combined characters somewhere?
No. That would be a vast collection. I'm proposing that the WG creates or sources just the images it needs, but that the respec authoring would allow them to easily show those images with attached Unicode names.
Seems like respecConfig.unicode.images
function would be best way to support images then. We can pass codepoint, full text (of shorthand element) as well as reference to that element as parameters. The WG/spec can store images at a convenient location with file name such that it makes the images
function simple.
@sidvishnoi The I18N WG has been waiting on this feature for ~1 year now. I was actioned with writing to follow up on when this might make an appearance or how we can help with implementation. Can you ping back with an estimate?
Hi, sorry for the long wait. I've started work on it just today finally.
I think we should be able to get a working version within 2-3 weeks, as I try to find time from my regular job - please remember I'm a volunteer here.
I'm not sure about the images part yet, but the invisible characters ones should be there in initial version.
how we can help with implementation
@sidvishnoi Thank you so much! We appreciate your working on this--we understand about the volunteer aspect.
@r12a, @xfq and I can help with CRs.
Just a ping @sidvishnoi. Any progress? Any way we can help?
Hi @aphillips, there's in indeed some progress this time, but I couldn't reach completion, as you can see in linked PRs: https://github.com/speced/respec/pull/4748 and https://github.com/speced/respec-web-services/pull/424.
@xfq Can you provide set of images? I'll add those when I resume work on this, hopefully in one of coming two weekends.
If someone can add more test cases to https://github.com/speced/respec/pull/4748, that would help encourage me and also set expectations on requirements necessary for an initial release.
Is your feature request related to a problem? Please describe. The i18n WG is developing recommendations for referring to one or more characters in markup (see https://w3c.github.io/bp-i18n-specdev/#char_ref_template).
The most basic template for the expanded markup is:
This is not complicated, but it's a bit lengthy and fiddly for authors to type in full, especially if a sequence of characters is involved. We'd therefore like to propose a macro that can be used with respec docs to automatically create the full markup from a more concise base.
Describe the solution you'd like We propose the following expansions, where
lang
attribute is strongly recommended, and has a BCP47 language code as its valueclass="hx"
, and the latter usingclass="ch"
Examples: [1]
OR
--->
[2]
OR
--->
It may also be useful to have a way of indicating that no bdi element is wanted (although much of the time an image would be useful as a replacement). Maybe something like:
For invisible characters or tricky to display characters (such as certain combining marks), more complete solution would allow for an image in the expanded markup. For example:
If it's possible to standardise or accept user input wrt the image location, this could be achieved with a shorthand such as the following, where an additional class name of
img
orsvg
is used.(Btw, I can provide a set of images for invisible characters, eg. .)
Additional context Note that there is intentionally no span between
</bdi><span>
. The gap will be provided by styling (which avoids problems with variable space widths and makes it possible to reduce the gap or change it at scale if needed).