w3c / adapt

Semantics to describe user personalization preferences.
https://w3c.github.io/adapt/
Other
51 stars 27 forks source link

Could we build symbolic annotations with existing Web standards? #240

Open DuncanMacWeb opened 1 year ago

DuncanMacWeb commented 1 year ago

I have come across this draft specification only very recently, and I can see that a lot of effort has gone in to it. I’ve read through it and have some thoughts around its design, which I’ve put into the form of a proposal for your consideration.

I have approached this by looking at the requirements for symbolic text in the WAI-Adapt specification, and thinking about how these requirements could be met using existing standards. (I have also reviewed the other issues in this repo, as well as the WAI-Adapt and WAI-Personalization mailing lists to see if the topics below have been raised previously, and have not found specific prior discussion related to these ideas.)

I hope the proposal below is helpful, and am keen to hear your thoughts.

Summary

(Expand for details) This is a proposal to use existing standards to mark up symbolic content, including ruby text, the zbl standard language code for Blissymbols, and the Unicode Universal Character Set. There is a standard language code for Bliss (`zbl`) and a proposal to encode Blissymbols in the Universal Character Set. Since the symbol values referenced by the WAI Adapt: Symbols module are the same as Blissymbolics Communication International’s Authorized Vocabulary, and since that contains Bliss-words, this is a proposal to treat these runs of Blissymbols like content in any other language marked up using the Universal Character Set, so that it is accessible and displayable just like any other Web content. In addition, this specification can restrict the Bliss-vocabulary that is supported by this specification to mark up symbolic content for display in the user’s symbol set to only Bliss-words from the Authorized Vocabulary, so that these Bliss-words can be used as keys (IDs) to refer to the concepts described by the Authorized Vocabulary. I propose to make use of the Web’s existing standards for representing content in other languages, and in particular for marking up semantic annotations on individual words or phrases using Ruby text (`` and ``) with the `lang="zbl"` attribute. This approach would allow Web authors to produce content annotated with Blissymbols from BCI’s Authorized Vocabulary, which could then be transformed into the user’s preferred symbol set either by the web page itself, or by the user agent or a browser extension. The key prerequisite for this is to standardise Blissymbols’ code points in the Universal Character Set through Unicode’s processes.

Proposal

Introduction

(Expand for details) This proposal would revise the WAI-Adapt: Symbols Module to make it more widely usable, particularly by users of existing browsers and devices, by building on existing standards. This proposal would revise the WAI-Adapt: Symbols Module to make it: - possible for users to read content built on it with existing browsers and devices - build on existing standards, including Unicode, in order to represent Blissymbols as a language alternative - keep space for custom user agents and browser extensions to improve the user experience around displaying symbolic content and customising the symbol set by applying specific transformations to Bliss content

Prerequisites

The BCP 47 zbl subtag, representing Blissymbols

Content in Blissymbols can already be described using the zbl subtag from the BCP 47 standard, so that we can tag Bliss content using HTML’s lang="zbl" attribute.

Proposal for a subtag to represent BCI’s Authorized Vocabulary (✅ done)

(Expand for details) Tagging symbolic content with the ~proposed~ variant subtag bciav (lang="zbl-bciav") would allow user agents and browser extensions confidently to identify symbolic content intended to conform to BCI’s Authorized Vocabulary. ~It is proposed to standardise a [variant subtag](https://www.w3.org/International/articles/language-tags/#variants) through the BCP 47 process, to represent Bliss-text which conforms specifically to the controlled vocabulary described in the W3C Alternative and Augmented Communication (AAC) Symbol Registry, which is the same as [Blissymbolics Communication International’s Authorized Vocabulary](https://www.blissymbolics.org/index.php/symbol-files).~ It is proposed to modify the Symbol Registry to refer to each item in the BCI’s Authorized Vocabulary using the Unicode code points corresponding to its Bliss-text, which will be standardised in the Universal Character Set (described below). ~This proposal suggests that~ The string `bciav` ~could be chosen~ [has been standardised](https://mailarchive.ietf.org/arch/msg/ietf-languages/cFA_dwu_nTR6UHo6wJ9GBbnFhH0/) as the BCP 47 variant subtag to represent content which adheres to BCI’s Authorized Vocabulary. ~This would depend on the outcome of the subtag registration process, which consists of sending a suitable [registration template](https://www.iana.org/assignments/lang-subtags-templates/lang-subtags-templates.xhtml) along with a supporting description/explanation and references, to the [ietf-languages mailing list](https://www.ietf.org/mailman/listinfo/ietf-languages) for review.~ Thus, though the example markup in this proposal uses the `zbl` language tag to describe content written in Bliss, ~if `bciav` were registered as a variant subtag,~ this could be replaced with `zbl-bciav` to indicate Blissymbols content conformant with the vocabulary listed in BCI’s Authorized Vocabulary (the same as the W3C AAC Symbol Registry). This would allow user agents and browser extensions designed to display symbolic content using alternative symbol sets to identify with confidence Blissymbolics containing concepts from the Authorized Vocabulary which could then be transformed into the user’s preferred symbol set. ~(Please note that as the proposed `bciav` subtag is not yet registered in the BCP 47 Subtag Registry, so `zbl-bciav` would not yet be a valid BCP 47 language tag, until `bciav` is so registered.)~

Blissymbols in the Unicode Universal Character Set

(Expand for details) This proposal depends on encoding Blissymbols in Unicode’s Universal Character Set. Once Bliss is in Unicode, the W3C AAC Symbol Registry could be keyed by the Unicode code point sequence for each Bliss-concept. This would enable Bliss content to be included in web pages like normal content, and let developers create special ways of inputting Blissymbols ([input method editors](https://www.w3.org/TR/ime-api/#background) and keyboard layouts) like they would do for any other language. There is an advanced proposal to encode Blissymbols: [N5130 (2020) (pdf)](https://www.unicode.org/L2/L2020/20140-n5130-blissymbols.pdf), [N5228 (2023) (pdf)](https://www.unicode.org/L2/L2023/23138-n5228-blissymbols.pdf) in the Unicode Universal Character Set (UCS) in the Secondary Multilingual Plane (Plane 1) [in the range U+16200–167FF](https://www.unicode.org/roadmaps/smp/). See: ScriptSource, [Unicode Status (Blissymbols)](https://scriptsource.org/cms/scripts/page.php?item_id=entry_detail&uid=ulc2tfhavw) If Blissymbols were encoded in this way, the [“Symbol” column in the W3C AAC Symbol Registry](https://www.w3.org/TR/aac-registry/#symbol-registry) could be modified to provide the Unicode code sequence for each concept as the key, instead of the current ID. This would provide several further advantages which would tend to make it easier to input content using symbols: - it would enable alternative input method editors (IMEs) to be developed using each operating system’s appropriate API. IMEs facilitate composing characters in complex writing systems. Such IMEs could include ones designed to input Bliss-concepts in the Authorized Vocabulary through users: - inputting keywords (in English or their native spoken language) to search the vocabulary list - selecting symbols in their preferred symbol set, which would be transformed into the correct Blissymbols by the IME - it would enable conventional keyboard layouts for Bliss (for example, as in [this Unicode proposal for a Bliss keyboard layout (PDF)](http://www.unicode.org/L2/L2020/20271-n5149-blissymbols-kbd.pdf)) to be included in computer operating systems, to enable authors to input content in Bliss directly from their keyboards. - it would let typographers include Blissymbols in universal font sets, such as the [Noto family](https://notofonts.github.io/), which aim to provide coverage of every writing system in the Universal Character Set. WAI-Adapt: Symbols Module could then specify that web authors write out symbolic content in Bliss from the BCI Authorized Vocabulary using the Unicode code points for Blissymbols.

Markup

(Expand for details) We can use existing markup standards to mark up symbolic markup. Ruby markup (<ruby> and <rt>) is designed to let us associate individual words with annotations, which can be marked up in a different language (including Blissymbols) using the lang attribute. Specialised user agents and browser extensions would be able to display these annotations in the user’s preferred symbol set. In order to make symbolic markup accessible to users of existing devices and browsers, and to make this specification more widely applicable, I propose that this specification make use of _existing markup standards_, the `lang` attribute with `zbl` to signify Blissymbols, and of the Unicode Universal Character Set instead of numeric indices, to represent the WAI-Assist symbolic vocabulary. (For related guidance on multilingual markup, see W3C, [Internationalization Best Practices: Specifying Language in XHTML & HTML Content; Using attributes to declare language](https://www.w3.org/International/geo/html-tech/tech-lang.html#ri20030510.102829377).) This will allow users to access content marked up with Blissymbols using their existing browsers and devices, and could make it easier for publishers to integrate Blissymbols into their existing internationalization and localization systems. Browser extensions and specialised user agents could then transform the Blissymbols into other symbol sets, just as in the current proposal. However, in case users are not viewing the page with a browser or browser extension designed to display the symbolic content, authors would be able to provide an interface allowing users to show or hide the symbolic content and, if they wished, to customise the symbol set shown.

Inline symbols

Ruby markup for annotations of words and short phrases
(Expand for details) [Ruby markup](https://www.w3.org/International/articles/ruby/markup) is designed to allow web authors to mark up annotations to be laid out with the main content (above, alongside or interleaved), as a gloss or alternative representation of the main content:

HowHow to Maketo Make aa GoodGood CupCup of Teaof Tea

> Ruby is a small-sized, supplementary text attached to a character or a group of characters in the main text. A run of ruby text... indicates the reading or the meaning of those characters — W3C Working Group Note, [Requirements for Japanese Text Layout](https://www.w3.org/TR/jlreq/#usage_of_ruby) > Ruby may be used ... because the intended readers of the text are still learning the language and are not expected to always know the ... meaning of a term; ... ruby may be used to show the meaning... of a possibly-unfamiliar... word. — Wikipedia, [Ruby character: Uses](https://en.wikipedia.org/wiki/Ruby_character#Uses) Ruby annotations can be applied to individual words or word combinations. Ruby text is most used in East Asian (CJK) languages, but can in fact be used with any languages. The `` “ruby text” (annotation) element can accept a `lang` attribute to indicate that the annotation is in a particular language different from the surrounding text.
Examples of Latin characters used either in base text or ruby text for Western words.
W3C Working Group Note, Requirements for Japanese Text Layout
We can use this to mark up content in Bliss using Blissymbols’ Unicode code points, once standardised. This example shows how a run of text could be marked up with the equivalent Bliss-words (placeholders for illustration): ```html

How to Make a Good Cup of Tea

```

HowHow to Maketo Make aa GoodGood CupCup of Teaof Tea

The Ruby text could be hidden by default, if the author preferred, using CSS or by applying the `hidden` attribute, so that it could be revealed by a user action or browser extension. Browser extensions or user agents could then substitute symbols from other symbol sets for the Blissymbols. @r12a might be able to comment more on this.
Alternative markup for inline Blissymbols
(Expand for details) If the author feels that word-by-word annotation (for which Ruby markup is well-suited) is not desired, they may instead wish to add symbolic alternatives to their principal content at the level of the sentence or block (paragraph). It is possible to lay equivalent content in two languages out in an interleaved fashion. This can be achieved by interleaving the same content side by side with appropriate `lang` attributes, e.g. for English and Bliss: ```html

How to Make a Good Cup of Tea

```

How to Make a Good Cup of Tea How to Make a Good Cup of Tea

An example of this technique can be found in the [W3C Requirements for Japanese Text Layout](https://www.w3.org/TR/jlreq/).
Providing the user with an interface to choose

The W3C Requirements for Japanese Text Layout also provides includes an example of an interface allowing the user to choose dynamically between English, Japanese, or both languages side-by-side, and uses CSS to show or hide interleaved English and Japanese text. Edit: To be clear, this is just to indicate how Japanese and English content can be presented side-by-side; the mechanism uses JavaScript, and is unrelated to ruby text.

This technique could be applied if the author used either of the two inline markup approaches above — to show or hide Blissymbols as ruby markup or side-by-side text.

Whole-page symbolic equivalent

Inter-page links for whole-page Bliss content

(Expand for details) We can use the `` and `` tags to link to equivalent content in other languages, including Blissymbols. The Web already provides a standard mechanism to indicate an alternate language version of a page. If a web author wishes to publish symbolic equivalents for a whole page, they can publish the same content using Blissymbols text, encoded using Unicode’s Universal Character Set, at a different URL and link to it in a machine-readable fashion, so that the link can be surfaced and read by automated tools like custom user agents and crawlers. The following form indicates that the linked `href` provides equivalent content in the indicated language, using the language code for Blissymbols: ```html ` ``` The same link can be offered as part of the page content so that a user can choose to navigate to it: ```html Symbols ``` These forms could be helpful to web authors who: - already publish their content in multiple languages, and would prefer to provide a symbolic equivalent through Blissymbols in the same way as they do for other languages - wish to provide all of their content (on a given web page) in a symbolic equivalent, rather than only certain fragments
r12a commented 1 year ago

Ruby markup (<ruby> and <rt>) is designed to let us associate individual words with annotations, which can be marked up in a different language (including Blissymbols) using the lang attribute.

Ruby text is most used in East Asian (CJK) languages, but can in fact be used with any languages.

Ruby markup and styling has been carefully crafted to cater specifically for CJKM requirements, and is not designed as a general glossing mechanism. The term used for markup ('ruby') describes an annotation method that is primarily (though not exclusively) used for phonetic annotation, rather than semantic annotations, and is found in a specific context. I therefore do not recommend using it for this purpose, but would see its use a hack/workaround rather than a proper semantic approach.

The W3C Requirements for Japanese Text Layout also provides an example of an interface allowing the user to choose dynamically between English, Japanese, or both languages side-by-side, and uses CSS to show or hide interleaved English and Japanese text.

To just be clear, JLReq doesn't use ruby for this (nor do CLReq and TLReq). (I designed the mechanism.) It uses JavaScript.

DuncanMacWeb commented 1 year ago

Thanks, @r12a. I would emphasise that the use of ruby text is not absolutely of the essence of this proposal; the essence is to encode Blissymbolics as runs of Unicode code points in the HTML along with the rest of the content, and use the lang attribute to allow user-agents, etc, to distinguish it from text in the principal language. So if there is a more suitable way of marking up the Bliss content (that you might be aware of) — particularly in a way which also semantically marks it up as an alternate rendering of its corresponding main-language text — then that could potentially be substituted for ruby text. However I’d ask what would be needed to enable the existing ruby markup model to be extended to include this use case, unless there is a compelling reason to avoid that.

Ruby markup and styling has been carefully crafted to cater specifically for CJKM requirements, and is not designed as a general glossing mechanism.

After raising this issue I also came across previous comments of yours to a similar effect, apologies for missing those:

please don't assume you can use ruby markup for this ! there may be similarities if you're annotating stuff, but ruby is semantically CJKM specific markup

I understand that much work has gone in to making ruby annotations work well in CJKM contexts; so far, ruby text is normally used in a CJK- and related linguistic context. But what would enable its being extended to other languages? The W3C docs themselves provide examples of English text being annotated with Japanese or Chinese, and vice versa; I believe this is because English is a common foreign language for language learners whose native languages are related to the CJK group. However, if the second language being used were Bliss (or Arabic, Spanish or Zulu, etc) then that could be used in place of the English text in those contexts. This could also just as well be Bliss text.

This would be an extension of the existing ruby use pattern, but if Bliss content could be used as the base text with CJK content as the ruby text, or vice versa, then why could this technique not be applied to other language pairs? What is to prevent ruby text from being used as a general typographic technique to mark up or typeset text in one language with glyph- or word-level annotations in the same language or another one, where the languages of the base and ruby texts are arbitrary?

One argument might be that the very name “ruby” comes from British typesetting, in which “ruby” denoted runs of very small text. However this would relate to formatting concerns (the domain of CSS); as specified in HTML, ruby is an annotation mechanism, and default rendering can vary depending on the language. For example, it could be decided that a sensible default rendering for ruby text in Bliss, marked up with lang="zbl" — at a purely presentational level — would be inline, at the same size or perhaps even a little larger than the surrounding text, if that makes the most sense.

I appreciate your input, and would just like to understand what reasons might prevent ruby from being used as a general glossing mechanism. Happy to consider alternatives!

r12a commented 1 year ago

I would emphasise that the use of ruby text is not absolutely of the essence of this proposal

Understood.

what would be needed to enable the existing ruby markup model to be extended to include this use case, unless there is a compelling reason to avoid that

We have been struggling to get full implementation of the ruby markup model (and therefore some of the styling too) for several years, and we still have a small way to go and the hope that this might soon happen. I would definitely not want to imperil the movement we currently have by going back to the drawing board about aspects of how ruby could work. I think that at this point we just need to get it done asap, without distractions. Sorry.

I wonder whether this would be useful as an alternative to consider: https://r12a.github.io/blog/201708.html#20190304 The examples in that post are probably far more complicated than you'd need, but it should work just as well with just a couple of lines.

DuncanMacWeb commented 1 year ago

I understand what you are saying; it would not be desirable to imperil that movement.

This proposal is looking for a markup model which lets us represent the symbolic alternative as semantically associated with the principal text as an alternative, equivalent reading of it encoded with Unicode code points. That is why ruby seemed to fit the bill, since it can be (and, as I understand, is) used to mark up semantic equivalent text, and is also machine-readable.

Thus, if you are sure that ruby markup isn’t appropriate because it is so specific to East Asian typography, then perhaps this suggests that a new element (or attribute) might be needed. The goal would be to identify an element or attribute that would let user agents/extensions identify symbolic content, while still allowing that symbolic equivalent to be parsed as content and displayed by legacy user agents which don’t understand the markup, and hidden using hidden or CSS if the author so chooses. Because of this, semantically-neutral <span>s are unlikely to meet this requirement, unless we can find an attribute which can indicate that its element provides alternative text for its parent.

Suggestion: <alt>

Might there be a place for an <alt> element to indicate alternative content for its parent? For example:

<!DOCTYPE html>
<html lang="en">
  <body>
    <h1>
      Page
      <alt lang="zbl">
        <!-- Bliss for "Page" -->
      </alt>
      Title
      <alt lang="zbl">
        <!-- Bliss for "Title" -->
      </alt>
    </h1>
  </body>
</html>

This could be defined semantically to represent an alternative to the preceding sibling content between the start of the parent element, or the previous <alt> tag, and the start of the current <alt> tag.

Suggestion: aria-symbollabel

In certain cases, particularly where text content is already represented by an attribute such as alt, an attribute would fit this scenario better than an element. Since WAI-ARIA already provides a mechanism to represent a Braille alternative, aria-braillelabel, and symbolics as AAC are an accessibility tool, a new ARIA attribute might be considered.

Enter aria-symbollabel, whose content, as above, could be specified as a series of Unicode code points in the Blissymbols range (to be standardised).


Note: I’m aware from the Technology Comparison Summary that an additional ARIA attribute has been considered as one of the alternative technology choices, but I haven’t found any explanations as to why the technologies mentioned there were rejected.

lwolberg commented 1 year ago

Thank you for the detailed description. The topic was discussed at the 1st May 2023 Adapt teleconference, minutes can be found here, https://www.w3.org/2023/05/01-adapt-minutes.html#t04.

We appreciate the work and thought that is going into the unicode representation of Bliss. We are not sure that this proposal would result in the type of 'primary key' that we require for the registry. (Note that the registry requires such a unique identifier for each row, ensuring that each symbol has a unique referent.)

russellgalvin commented 11 months ago

Thank you for taking the time to put forward your proposal regarding using sequences of the currently proposed Blissymbol Unicode code points instead of the currently used ID as the unique identifier for concepts in WAI-Adapt. We have thought long and hard on this and although we definitely see the value in using an existing widely used standard such as Unicode, we have come to the conclusion that in this situation our users' requirements would not be fulfilled by such an approach. Recall that our use case is to allow web content creators to author using the BCI value as an attribute in web content that can allow users to obtain the appropriate corresponding symbol for a concept (expressed in text) taken from their preferred symbol-set. The Unicode value simply doesn't function as a sufficiently robust database key to support this content annotation scenario. There are two main problems that we foresee with such an approach:

1) The unique key for a concept must remain the same for all time and not change if future developments result in the graphical representation of a concept changes. This can occur, for example, if technological developments result in the pictographic representation of something such as "clock" or "telephone" change from rotary, analogue devices to digital. Since the Unicode code points map to specific characters and those characters are sequenced to graphically represent a concept, if it is decided that the representation of the concept needs to be changed, the associated sequence of codes will also change. This is not acceptable for a number that is to be used as an immutable unique ID. It is possible that for certain concepts represented by a single code point the glyph can be changed without changing the code but in general, for concepts that are represented by multiple character sequences, this is not the case.

2) As new concepts are added to the vocabulary, a new ID can be assigned to the concept without having actually developed the graphical representation for it. This is not the case with Unicode code sequences of Blissymbols which are tied to the associated character glyphs. If Unicode code sequences were used as the sole ID for concepts, all WAI-Adapt implementors would have to wait until the Blissymbol graphic for a concept is developed before they could begin to use the concept ID. This would clearly not be an ideal situation.

Both of these issues are due to the fact that the Unicode code points are directly coupled to the graphical representation of the concept whereas the ID we require must be independent of the representation.

I would, however, like to point out that we do intend to use the Unicode representation of Blissymbols in our example registry when they become available. Anyone who wishes to use these code sequences as an alternate ID will be fully able to do so.

Russell Galvin, on behalf of WAI-Adapt

DuncanMacWeb commented 10 months ago

@lwolberg @russellgalvin Thank you for your detailed comments. I agree that these two motivations do have merit, and appreciate your consideration of the proposal.

(As an aside, I have edited the issue summary to document that the bciav variant subtag, denoting the BCI Authorized Vocabulary, has now been standardised.)

annevk commented 9 months ago

I'm not sure the reasons Russell helpfully lays out above are sufficient to close this issue. Unicode has a lot of experience managing code points, including dealing with subtle changes over time, and if they are already in the process of standardizing these I don't think we want to depend on a second registry.

And having to wait a little longer for new code points to enter the space is perhaps not ideal, but I don't think that outweighs the severe cost of several registries for downstream standards to build upon.

I'd therefore like to request this issue to be reopened.

zcorpan commented 9 months ago

I agree with @annevk. To me Unicode seems like the correct place.

andjc commented 9 months ago

Although implied in @russellgalvin's description is that the concept is non-textual in nature and tangential to any text and text encoding.

annevk commented 9 months ago

Emoji are non-textual in nature too. And Unicode is not a text encoding. It's a registry. It also defines how that maps to encodings, such as UTF-8, but that's not the primary purpose. Ultimately if we're going to build other standards on top of this that would have to be done on top of Unicode. We don't want multiple registries at the basis of web standards.

nigelmegitt commented 9 months ago

I asked "why not Unicode?" at TPAC also, and the answer given does not seem to be clear in this thread, but convinced me that the best solution may indeed be more complex.

The argument was that there is not a one to one mapping between words and symbols across symbol sets. Different symbol sets express concepts in different ways, with different symbols that may or may not combine together. To offer a choice of symbol set representations to the user, the page author would need to translate their source text separately into each symbol set the user might want. That seems onerous to the page author, and limiting for users: what if the page author didn't choose their preferred symbol set?

I'm not offering a solution here, just restating the argument as I heard it.

DuncanMacWeb commented 9 months ago

Thanks, @annevk @zcorpan. I’m happy to reopen this so that a Unicode-based solution can be considered more fully.

The essential requirement which the AAC registry meets is to provide a canonical list of concepts, which can be combined as authors and users require, and represented graphically using a variety of different symbol sets. The specification envisages that this graphical representation would be done by specialist user-agents, but it could equally be done by browser extensions or a JavaScript library loaded by the web page.

In other words, Bliss symbols are not an absolutely necessary component; rather, what is necessary is the canonical list of concepts (and mapping of each to a key, ID or code point). It makes sense that the Bliss Authorized Vocabulary has been chosen, as I understand that it provides the most comprehensive set of symbols currently available in the AAC space. However, its value to the specification is its list of concepts, not the particular Bliss-symbols to which they correspond.

the page author would need to translate their source text separately into each symbol set the user might want

The objective of the keys in the AAC Registry is to avoid this need to translate source text into different symbol sets. If alternative codes are chosen, such as Unicode code point sequences, then those codes would be mapped to the corresponding concepts. It is then for implementors — user agents, browser extensions and software libraries — to render these in the most appropriate way for the user. A key advantage of using Unicode code points as keys is that almost all existing software that enables users to consume Internet content is designed to render Unicode, using whatever fonts are available. This would enable use of a much wider set of software and user agents to consume symbolic content (particularly if the symbolic content were marked up as inline text, as noted above in the proposal).

We can map out the options for defining AAC Registry keys as Bliss AV IDs and as Unicode code points:

graph TB
    need(["Need to represent
concepts defined by
Bliss Authorized Vocabulary"])
    mapDirectly["Map to Bliss AV IDs directly"]
    need --> mapDirectly
    need -- "Unicode can map
codepoints to meanings"
    --> action["Map AAC Registry
to Unicode (somehow)"]
    action --> blissAction["Map to Unicode Bliss strings"]
    action --> emojiAction["Map to emoji?"]
    action --> hybridAction["Hybrid approach?"]
    hybridAction --> hybridDescription(("Map to single-character Bliss
+ relevant emoji
+ remaining concepts
in new code block"))
    action --> newblockAction["Encode Bliss AV in new
Unicode block?"]

Considering further how Unicode could be used to achieve the goals of the specification, I can see several possibilities.

  1. Map directly to Unicode Bliss strings

This has already been discussed.

  1. Map to emoji

Unicode already encodes some real-world concepts—obvious examples are the national country codes (which are used to render flags) and many of the emoji which have particular meanings like 🥦 “broccoli”.

Comparing the latest Bliss AV ID to gloss map and the Unicode Full Emoji List, there appears to be substantial overlap between the two lists, particularly for nouns that refer to objects (expand to view):  
Table mapping Bliss AV IDs and glosses to potentially matching emoji code sequences. Covers only Bliss-words in the basic A-C range (12369 to 13620).
| Bliss AV ID | Bliss gloss | Potentially matching emoji | | ----------- | ----------- | -------------------------- | | 12369 | ambulance | U+1F691 ambulance | 12395 | ant | U+1F41C ant | 12405 | apple | U+1F34E red apple
U+1F34F green apple | 12577 | arm | _U+1F4AA flexed biceps_ | 12581 | art | U+1F3A8 artist palette | 12592 | atom | U+269B atom symbol | 12600 | avocado | U+1F951 avocado | 12614 | bacon | U+1F953 bacon | 12621 | ball | _U+26BD soccer ball_
_U+26BE baseball_
_U+1F94E softball_
...
_U+1F52E crystal ball_
_U+1FAA9 mirror ball_ | 12622 | balloon | U+1F388 balloon | 12623 | banana | U+1F34C banana | 12626 | bank | U+1F3E6 bank | 12638 | battery | U+1F50B battery | 12642 | bean | U+1FAD8 beans | 12643 | bear | U+1F43B bear
U+1F43B U+200D U+2744 U+FE0F polar bear
U+1F9F8 teddy bear | 12644 | beard | _U+1F9D4 person: beard_ | 12646 | beaver | U+1F9AB beaver | 12649 | bed | U+1F6CF bed | 12653 | beer | _U+1F37A beer mug_ | 12655 | beetle | U+1FAB2 beetle | 12662 | bell | U+1F514 bell | 12666 | berry | _U+1F353 strawberry_ | 12834 | bicycle | U+1F6B2 bicycle | 12837 | bird | U+1F426 bird | 12846 | birthday | _U+1F382 birthday cake_ | 12861 | blood | _U+1FA78 drop of blood_ | 12873 | bone | U+1F9B4 bone | 12875 | book | _U+1F4D5 closed book_
_U+1F4D6 open book_
...
_U+1F4DA books_ | 12876 | boot | _U+1F97E hiking boot_
_U+1F462 woman’s boot_ | 12893 | brain | U+1F9E0 brain | 12905 | broccoli | U+1F966 broccoli | 12908 | broom | U+1F9F9 broom | 12912 | brush | _U+1F58C paintbrush_
_U+1FAA5 toothbrush_ | 12916 | bubble | _U+1FAE7 bubbles_ | 13095 | butter | U+1F9C8 butter | 13109 | calendar | U+1F4C5 calendar | 13110 | camel | U+1F42A camel
_U+1F42B two-hump camel_ | 13111 | camera | U+1F4F7 camera | 13112 | camp | _U+1F3D5 camping_ | 13116 | Canada | U+1F1E8 U+1F1E6 flag: Canada | 13117 | candle | U+1F56F candle | 13157 | cheese | _U+1F9C0 cheese wedge_ | 13170 | chipmunk | U+1F43F chipmunk | 13181 | Christmas | _U+1F384 Christmas tree_ | 13346 | cigarette | U+1F6AC cigarette | 13367 | cloud | U+2601 cloud | 13372 | coconut | U+1F965 coconut | 13374 | coin | U+1FA99 coin | 13375 | cold | U+1F976 cold face | 13384 | comet | U+2604 comet | 13413 | corn | _U+1F33D ear of corn_ | 13430 | cow | U+1F404 cow | 13620 | cucumber | U+1F952 cucumber

Unicode has a mechanism (Variation Sequences) to indicate to a renderer how a particular code point should be rendered, where there is a choice — for example, using emoji rendering or monochrome rendering. In other words, some codepoints which are now commonly rendered as emoji already existed (with default monochrome rendering) before emoji were developed in the standard, yet these pre-existing codepoints can now be rendered as emoji.

Thus, variation sequences could be used to indicate that a codepoint should be rendered using an AAC symbol set.

  1. Encode Bliss AV in a new Unicode block

The latest edition of Bliss AV encodes 6,183 concepts. This would fit comfortably within a new Unicode block of 8,192 codepoints.

This would result in the concepts from the Bliss AV registry being encoded — separately from the Blissymbolics glyphs — in their own Unicode block. I imagine that this would be named more generically such as “AAC symbols”, as it would be logically separating out the registry concepts from the actual written Bliss symbols.

  1. Adopt a hybrid approach

Unicode has a preference to avoid duplication. Since a subset of the concepts in Bliss AV can already be found in Unicode (or are expected soon to be encoded) in the forms of emoji or Blissymbols (for those Bliss AV concepts which can be expressed with one Bliss code point), it may be that these could be omitted from any set of AAC concepts that would be put forward for inclusion in the Universal Character Set.

Regardless of which approach were taken, Input Method Editors (IMEs) and specialist keyboards could be developed to assist with inputting the correct code points. Users and authors should not have to deal with code points directly, so if the Bliss AV concepts were fragmented between emoji, Blissymbols and their own block, this should not be a problem for users.

DuncanMacWeb commented 9 months ago

@annevk is there anyone you would like to bring in who can comment on this from a Unicode perspective?

aphillips commented 9 months ago

I added this to today's (2023-09-21) I18N WG agenda.

r12a commented 9 months ago

The unique key for a concept must remain the same for all time and not change if future developments result in the graphical representation of a concept changes. This can occur, for example, if technological developments result in the pictographic representation of something such as "clock" or "telephone" change from rotary, analogue devices to digital. Since the Unicode code points map to specific characters and those characters are sequenced to graphically represent a concept, if it is decided that the representation of the concept needs to be changed, the associated sequence of codes will also change. This is not acceptable for a number that is to be used as an immutable unique ID. It is possible that for certain concepts represented by a single code point the glyph can be changed without changing the code but in general, for concepts that are represented by multiple character sequences, this is not the case.

I was recently on a Bliss-specific Unicode Script Adhoc Committee call (this group vets new proposals for Unicode), with Michael Everson, who is championing the addition of Bliss. My understanding was that (unlike representative glyphs for other code points) the graphic design of Bliss symbols associated with code points would never change, because Bliss is a repertoire of shapes, which are defined carefully against a specific template grid and standardised to exact proportions by the Bliss folks (not Unicode). Michael was quite clear, as i remember it, that the shapes must never change in either the code charts nor in actual use.

The Unicode code points to be added for Bliss are determined by the Bliss community, rather than the Unicode Consortium, but i think Gavin's point is that more flexibility is needed to allow for other symbol repertoires to be used. Adding Unicode code points for all of these systems is likely to be a lengthy and complicated process. It's not clear to me whether it is possible to use the Unicode code points to represent semantics of other repertoires, rather than the symbols themselves, but there would probably be difficulty in dealing with that where Bliss doesn't represent the semantics of another repertoire.

aphillips commented 9 months ago

I18N contacted the Unicode Script Ad Hoc committee about this thread (mail archive is MO). They may respond directly or I will relay a response if that turns out to be more appropriate. If I understand correctly, the current status of the Bliss proposal is that Unicode is waiting for the resolution of some IP issues.

russellgalvin commented 9 months ago

[the Bliss vocabulary's] value to the specification is its list of concepts, not the particular Bliss-symbols to which they correspond.

Yes, exactly.

And what Michael Everson said about the Bliss characters never changing is absolutely correct. However, the relevant point with respect to the registry is that the characters do not map one-to-one with the concepts. There are many instances where one character does represent a concept but the majority of concepts are represented by multiple characters being combined together - in other words, a string of code points similar to a word in a textual language.

In addition, in the past, there have been instances where the combination of characters used to represent a particular concept is changed for one reason or another. Using the code point combination as the ID would mean that the ID for the concept changes which is not desirable. But perhaps Unicode Variation Sequences could deal with this issue.

what is necessary is the canonical list of concepts (and mapping of each to a key, ID or code point)

True - however, as stated above there cannot be a one-to-one mapping of concepts to (individual) code points for Bliss. What would be ideal would be if there existed a database of all concepts that could be mapped to. There have been attempts at this and it is not a trivial problem - WordNet comes to mind. The Bliss vocabulary represents a practical solution within the AAC domain that is acceptable to W3C copyright requirements.

The other issue is that the vocabularies of other symbol sets are not all subsets of the Bliss vocabulary. Even if they are smaller, there may be a large amount of overlap but there will usually be some symbols that are not in the Bliss vocabulary. We don't want Bliss to be a bottleneck to the usage of these other symbol sets which it would be if the requirement was that a Bliss version of the concept had to be created in order for the ID to exist. This would be the case if Unicode code points are used and a Bliss "spelling" has to be created before an ID can be assigned (since the ID would be the spelling).

In Duncan's diagram of options the "Map AAC Registry to Unicode" option is an interesting one as it occurs to me that the CJK block is a very large block of ideographs that could possibly map to the majority, if not all, of the Bliss vocabulary. However, I am not knowledgeable enough in those languages to know how successful that would be. It would certainly be a significant undertaking and there would many language nuances that would have to be dealt with. From the point of view of resources available in the AAC world - certainly in the Bliss world - I don't think it would be a practical solution.

Another of Duncan's options - "Encode Bliss AV in new Unicode block" is something that has been considered and rejected by Michael Everson and others (such as myself) who have been involved in this. The reason this is not very satisfactory (speaking for myself) is that the Bliss vocabulary is constantly growing so it would require regular updates which requires regular (annual?) submissions to Unicode and again, this requires resources that we just don't have. The CJK block represents languages used by billions of people so resources for regular updates is not a problem. Also, it is much preferrable to have the flexibility of a character-composable language as opposed to having to wait until Unicode is updated before you can use a new symbol.

aphillips commented 9 months ago

In Duncan's diagram of options the "Map AAC Registry to Unicode" option is an interesting one as it occurs to me that the CJK block is a very large block of ideographs that could possibly map to the majority, if not all, of the Bliss vocabulary.

Unicode has an entire plane dedicated to private use (plus a large block in the Basic Multilingual Plane). If you want to entries in an AAC registry to something in Unicode in this manner, you should really use that mechanism. Mapping AAC concepts to unrelated ideographs (or any other assigned characters) would introduce all manner of problem.

The term "unrelated" is highlighted because I don't object to mapping like-to-like, e.g. "smiley face" already exists in Unicode as 😄 and mapping a "smiley face" glyph to it might be fine for that purpose. But I would be unhappy if you mapped a concept to, say (U+54CD, chosen randomly), just because that character was next on some list.

cookiecrook commented 9 months ago

There are many instances where one character does represent a concept but the majority of concepts are represented by multiple characters being combined together - in other words, a string of code points similar to a word in a textual language.

Perhaps we could think of pictorial concepts not like phonetic or word glyphs, but similar to emoji variant combinations… Perhaps somewhat like ligature too, where the font can replace a set of characters with a single glyph. "f"+"i" becomes an "fi" ligature glyph… though that glyph probably doesn't render here in the GitHub comment.

As an example, the emoji representing "female, darker-skinned, doctor" ("👩🏾‍⚕️") is multiple unicode entries for those concepts individually.

So although it looks like a single glyph/character, I think the concepts in pictorial languages could be implemented similarly.

For example, this JavaScript changes the gender of our doctor by replacing one of the three characters underneath the single graphical representation.

> "👩🏾‍⚕️".replace("👩", "👨")
< "👨🏾‍⚕️"

In case the first example renders poorly, here's a screen shot of what I see. screen shot of the previous code example

Likewise, this one changes the tonal variant, but leaves the unicode representations of "female" and "doctor":

> "👩🏾‍⚕️".replace("🏾‍", "🏼")
< "👩🏼‍⚕️"

In case the second example renders poorly, here's a screen shot of what I see. screen shot of the previous code example

I'm not sure about all of the others' suggestions, but I think this or a similar pattern could be used to represent BlissSymbolics, ARASAAC, or other pictorial symbol sets if more come about in the future.

cookiecrook commented 9 months ago

"Encode Bliss AV in new Unicode block" is something that has been considered and rejected by Michael Everson and others (such as myself) who have been involved in this. The reason this is not very satisfactory (speaking for myself) is that the Bliss vocabulary is constantly growing so it would require regular updates which requires regular (annual?) submissions to Unicode and again, this requires resources that we just don't have.

That's the case with other blocks too. Emoji again, is an example that gets regular updates...

Perhaps this is unfair, but I read your comment as, "I don't want to have to maintain it" which I don't see as a good reason. If that's not what you meant, please help me understand the distinction.

Since Bliss and ARASAAC are being actively maintained and updated ("constantly growing" as you say), how is it a problem that the update pipeline should go through to the Unicode standardization process on some regular cadence? Sure it takes a little more time, but it seems worth the effort.

cookiecrook commented 9 months ago

Apologies if this had previously been covered, but does "zbl" include only Bliss at the exclusion of the overlapping ARASAAC symbol set? Assuming the goal is language-independent communication, why shouldn't the Unicode block (and BCP 47 "lang" value) be a superset (Bliss+ARASAAC+others) of pictorial, ideographic concepts? I get that the Bliss and ARASAAC were developed independently and are syntactically distinct, but it still seems like the conceptual registry could be a shared superset.

cookiecrook commented 9 months ago

I agree that Ruby (see collapsed example in OP) would be a great way to approach this, and perhaps even polyfill-prototyped (short term solution) with a custom CSS font that included all the Bliss glyphs for private use block entries.

r12a commented 9 months ago

I agree that Ruby (see collapsed example in OP) would be a great way to approach this,

Did you read the earlier part of this thread, specifically https://github.com/w3c/adapt/issues/240#issuecomment-1488956770 ff ?

russellgalvin commented 9 months ago

Re James Craig:

Perhaps this is unfair, but I read your comment as, "I don't want to have to maintain it" which I don't see as a good reason. If that's not what you meant, please help me understand the distinction.

In analogy, if you have the choice of encoding English as an alphabet or as a dictionary, which would you choose? That is the choice we're making here.

This is for the Unicode encoding of Bliss. However, the use of Bliss in the Adapt Registry requires the dictionary approach. It would not make sense to change the Unicode encoding approach in order to satisfy the Registry requirements.

andjc commented 9 months ago

Re James Craig:

In analogy, if you have the choice of encoding English as an alphabet or as a dictionary, which would you choose? That is the choice we're making here.

It depends on whether you want a dictionary or a registry.

A dictionary would be once off. It gets built, future needs are irrelevant. It is a snapshot of what is currently thought to be needed.

A registry, would have processes for submitting new concepts as needed, a process to review them. Registries tend to grow over time. There would be ongoing maintenance.

russellgalvin commented 9 months ago

The point is that you never have to update an alphabet whereas a dictionary has to be regularly updated with new vocabulary. There is no thought that dictionaries of the English language would ever be encoded in Unicode. At least as far as I know.

andjc commented 9 months ago

@russellgalvin actually alphabets do get up dated. But that is neither here nor there.

From what has been discussed so far, and from your comments, it is not textual data, and will not be textual data. So Unicode isn't needed. In that sense bliss and emojis exist in a different paradigm to this.

What you seem to be skirting around is the need for a higher level protocol. And possibly dedicated semantic markup.

It also seems to need a registry and a maintenance authority for the ongoing work on the registry.

cookiecrook commented 8 months ago

@r12a wrote:

I agree that Ruby (see collapsed example in OP) would be a great way to approach this,

Did you read the earlier part of this thread, specifically #240 (comment) ff ?

Yes. I agree Ruby has a shortcoming with regards to that issue, but I think it is resolvable.

More on that here: https://github.com/w3c/aria/issues/1620#issuecomment-954501658, and I've filed a new issue to track that here: https://github.com/w3c/ruby-t2s-req/issues/34

cookiecrook commented 8 months ago

@russellgalvin wrote:

The point is that you never have to update an alphabet whereas a dictionary has to be regularly updated with new vocabulary. There is no thought that dictionaries of the English language would ever be encoded in Unicode.

Thank you for the alphabet/dictionary metaphor. I think we are getting closer to a shared understanding.

I hope we agree that symbolic sets are different from both alphabets and dictionaries. While different from a phonetic glyph to abstract sound representation of a concept (f+o+x = 🦊), there are conceptual character ~"registries" in some human languages, especially CJK. One example is "京", often pronounced "jing" in Chinese, and "kyo" in Japanese, while retaining its conceptual meaning of "capital" or "capital city."

Some Chinese usage:

Some Japanese usage:

I agree that an unabridged dictionary has no place in Unicode, but from what I understand of symbolic sets, they are closer to a conceptual character registry than an alphabet: somewhere in the middle of your alphabet-to-dictionary spectrum. Updates to dictionaries don't need Unicode updates because they don't include any new glyphs. That's different from what is being asked for wrt Bliss.

I attempted to make this point above with the emoji example: "👩" + "⚕️" + "🏾‍" = "👩🏾‍⚕️"... Emoji is updated frequently, but conceptual registries are not limited to emoji. I think there would be a similar expectation of frequent updates in Unicode for a symbolic range (Bliss-specific or broader) if it were updating as frequently as you've implied.

Perhaps it would help if you gave an example how the Bliss glyph above: Cup may be used in a scenario unrelated to the concept of a "cup"? Or perhaps it'd be better to use a different abstract concept, like "be" or "want." Or perhaps the diacritic-like upward and downward pointing carets (^) in @DuncanMacWeb's example above could be used to explain why the unicode registry can't work. Thanks.

russellgalvin commented 8 months ago

Apologies if this had previously been covered, but does "zbl" include only Bliss at the exclusion of the overlapping ARASAAC symbol set? Assuming the goal is language-independent communication, why shouldn't the Unicode block (and BCP 47 "lang" value) be a superset (Bliss+ARASAAC+others) of pictorial, ideographic concepts? I get that the Bliss and ARASAAC were developed independently and are syntactically distinct, but it still seems like the conceptual registry could be a shared superset.

By definition BCP 47 identifies a language. So yes, "zbl" includes only Bliss as it represents a language albeit with variations. These variations, such as Blissym by Douglas Crockford of JSON fame (https://www.crockford.com/blissym.html) are not widely used. Semantography, Charles Bliss' original Bliss pre-cursor, would also obviously qualify. ARASAAC and other AAC symbol sets are by many not considered true languages in the sense that they are solely pictographic representations of ideas without semantic or grammatical structure. They are also certainly not versions of Bliss and thus would not qualify to be a subtype of "zbl".

russellgalvin commented 8 months ago

What you seem to be skirting around is the need for a higher level protocol. And possibly dedicated semantic markup.

I don't think I'm skirting around anything...at least not intentionally. The issue here, as I understand it, is what is to be used as the ID for a concept. The WAI-Adapt group developed the standard for how markup is to be done for this before I became involved. (see https://www.w3.org/TR/adapt-symbols/#symbol-explanation) It is really a separate discussion.

Note that those examples are designed so that any symbol set can be used for what is displayed to the user. The ID's used in the examples are the BCI-AV ID's. These were selected for several reasons including historical - that Bliss was basically the first AAC symbol set and effectively already IS the reference set in the AAC field with many (most?) other symbol sets already mapped to it. I have personally assisted several vendors to do such mappings and am aware of at least one such effort going on right now. It is conceivable that Unicode representations of Blissymbols could be used in place of BCI-AV IDs but I believe this to be misguided. It would tie the Bliss representation of the concept to the ID. This would create a barrier to the introduction of new WAI-Adapt usable vocabulary in any symbol set other than Bliss because it would require a Blissymbol to be created for the concept before an ID could exist because the ID would actually BE the representation. This is a classic case of unnecessary coupling in software engineering that will only create problems.

cookiecrook commented 7 months ago

It is conceivable that Unicode representations of Blissymbols could be used in place of BCI-AV IDs […]

Thanks for confirming there's no technical barrier to using Unicode. We're making progress.

but I believe this to be misguided. It would tie the Bliss representation of the concept to the ID. This would create a barrier to the introduction of new WAI-Adapt usable vocabulary in any symbol set other than Bliss because it would require a Blissymbol to be created for the concept before an ID could exist because the ID would actually BE the representation.

I don't see this standard process as restricting anything. WAI-Adapt etc can use polyfills to represent new concepts that haven't made yet it into Unicode—aren't polyfills how Bliss is rendered now anyway? The polyfill could be updated and/or removed once the glyph can be rendered natively. I don't understand why this is a problem.

This is a classic case of unnecessary coupling in software engineering that will only create problems.

In contrast, I see this as solving several problems. "…introduction of new WAI-Adapt usable vocabulary" would not be slowed down unless it required a new glyph or joiner rule. More importantly, if some new glyph/ligature or joiner rule is needed, the Unicode standardization process can ensure the rendering and adoption in each OS, browser, file format, and font.

russellgalvin commented 7 months ago

Thanks for confirming there's no technical barrier to using Unicode. We're making progress.

Yes, I agree we are making progress.

I don't understand why would you want to base a standard on polyfills which are essentially a kludge, a temporary solution. And you're saying that the WAI-Adapt group is going to provide the polyfills? If they are going to do that, then why not just have them maintain a database of IDs that they can issue permanently when a new one is needed? They don't need to be the BCI IDs. We can easily create a new set of IDs and map them to the BCI ones. It just moves the maintenance over to WAI-Adapt.

And please realize that if we take a polyfill solution it is also permanent. Why? Because there will be concepts that come up that will never be implemented in Bliss. The polyfill is then there forever because...the Unicode string for the Blissymbol will never be created and thus the Unicode-based ID will never exist. Might as well just make the polyfills the standard.

Although not central to the issue, perhaps you could explain how you see Bliss being rendered by polyfills. Bliss is primarily displayed as graphic elements right now. There exists some older software that uses an ASCII font to render Bliss but this is now obsolete. Most actual Bliss users use dedicated devices that display symbols using JPGs, PNGs, etc. Those devices are designed that way because Bliss alternatives - of which at least one will be supported on any of these devices - are purely pictographic so it is a natural choice. When Bliss is finally part of Unicode this initially may or may not change, depending on the device and decisions of the user, facilitator, or device designer. It will certainly change for other people for whom Bliss is not their first language but are interested in it and use it in a non-specialized computing environment. But the other symbol sets - of which there are many - will not likely make it into Unicode. This is relevant because it means that WAI-Adapt must always consider both approaches.

In contrast, I see this as solving several problems. "…introduction of new WAI-Adapt usable vocabulary" would not be slowed down unless it required a new glyph or joiner rule. More importantly, if some new glyph/ligature or joiner rule is needed, the Unicode standardization process can ensure the rendering and adoption in each OS, browser, file format, and font.

This sounds to me as though you are thinking that the Unicode string used as an ID for the concept would be directly rendered in non-Bliss symbol sets. This is not the case. Bliss has a much different structure than other AAC symbol sets. In fact, other symbol sets do not have structure. Each symbol is monolithic whereas Bliss is built up of characters in the same way as other languages. So the Unicode-string-ID would only function as an ID in the same way that the BCI ID functions as an ID. Bliss glyphs will never have much variation as the dimensions of the primitive shapes used are defined in the language specification. Other AAC symbol sets do not have glyphs because they do not have characters. They have full pictures of things and scenes representing ideas, etc. So it can only be a one-pictograph-to-one-ID mapping, not a character-to-character and therefore glyph-to-different-glyph mapping. So you can apply all the Unicode standardization process in the world but it won't change the essential differences in the AAC symbol sets.

annevk commented 7 months ago

A glyph can be a full picture, no? That's essentially what many emoji are. And a glyph can consist of multiple code points underneath (again, many emoji have this property too).

russellgalvin commented 7 months ago

A glyph can be a full picture, no? That's essentially what many emoji are. And a glyph can consist of multiple code points underneath (again, many emoji have this property too).

Yes, okay, point taken. The resulting image from combining characters is also a glyph.

What I was trying to get across is that no one has - as far as I know - deconstructed non-Bliss AAC symbol sets into characters that can be combined to create composite images. It would be possible to do so in very much the same way as it is done in Emoji. But it still wouldn't map correctly to Bliss. There would be some correct mappings - and I'd have to look at individual cases to see what the percentage would be - but I wouldn't expect it to be very high.

russellgalvin commented 4 months ago

After further thought and discussion, I have been persuaded that whatever approach is taken, there will always be a delay between the creation of a new symbol (and need for it's use), and the assigning of an authoritative ID so polyfill/shim type solutions are going to be used anyways by the browser, plugin, etc. So if there is no real advantage to using Bliss IDs then of course Unicode strings IDs are preferrable. There will a more extensive response from the group posted here shortly.

cookiecrook commented 2 months ago

Not sure why the cross-reference didn't show up here automatically, but there is some discussion about how to differentiate ambiguous uses of Ruby, and it's possible that disambiguation could include symbolic use.

matatk commented 1 month ago

Hi everyone,

Thanks for your interest, suggestions, and discussion so far. It looks like there is general consensus that it could be possible to use Unicode code points that correspond to Bliss-characters to refer to concepts (i.e. as the adapt-symbol attribute's value) - currently the attribute expects Bliss IDs (integers). Let's look at how both of these would be authored.

To help you understand the following examples, think of Bliss-characters in the same way as you would Chinese or Japanese characters - they mean something on their own (as opposed to representing phoenetics, like individual letters in e.g. English), and you can combine them to make more complex words.

Bliss IDs, on the other hand, exist for both the individual characters, and for the more complex words.

Here are a couple of examples: in each example, the same content is marked up with Bliss IDs (integers) and with the corresponding Bliss-characters (which would be input via their separate Unicode code point(s)). Because the code points are not finalised yet, and a Bliss-supporting font is not widely installed, images that show what the authoring experience may look like are used in the examples.

Example 1 (concept requiring a single Bliss-character to identify)

<p>Would you like a <span adapt-symbol="13882">drink</span>?</p>

The same HTML mark-up as above, with the Bliss ID replaced by a mock-up of the single Bliss character for "drink"

Example 2 (concept requiring multiple Bliss-characters to identify)

<p>A nice cup of <span adapt-symbol="17511">tea</span>.</p>

The same HTML mark-up as above, with the Bliss ID replaced by a mock-up of the two Bliss characters for "tea"

We want to make it easy for authors to find the Bliss-characters (code points) that correspond to the concept they want to convey. Even looking up info on the code points will not give them the full set of recognised concepts. This is because most recognised concepts (each identified by a single Bliss ID) are composed of multiple Bliss-characters, as per the example above. There are approximately 6,500 Bliss concepts, but only around 1,400 Bliss characters being added to Unicode.

In order to assist content authors, we have two things to add:

We're very interested as to your views on this, both implementation concerns, and any thoughts you may have on the authoring side of things.

Important note: Bliss' inclusion into Unicode is not yet finalised, so we must wait for that before making any normative specifications based on any Bliss-related code points.

Thanks to @russellgalvin for co-drafting this comment.

aphillips commented 1 month ago

(as an individual contributor, chair hat off)

We have a W3C Registry spec that enumerates existing concepts, and their corresponding Bliss IDs - this can be updated to include the Unicode code points that also identify the concept. We also hope to have the concept descriptions localised into several different languages.

If Unicode is encoding Bliss symbols, it would probably be a better idea to have such a "registry" at Unicode so that non-Web users have access to it as well. Unicode already manages stuff like this, cf. emoji sequences


I also notice that your examples use decimal numbers for the bliss symbols. Unicode code points generally use hex notation. Are the decimal values well known to users? How do the code point values and bliss numbers get related.

matatk commented 1 month ago

@aphillips to address your second question first: two different schemes for identifying symbols had been proposed:

  1. Bliss concept IDs - these are integers, and correspond to 1 or more Bliss characters. This is what we, the Adapt TF, proposed to start with.

  2. Direct entry of the Unicode characters - this was proposed in this thread, and involves inputting the Unicode code points directly as UTF-8 in the attribute value (hence no numbers are given for those examples). In order to cover some established Bliss concepts, more than one code point (Bliss character) may be required.

The examples are presented side-by-side above to indicate the authoring considerations for these two approaches.

We believe a registry would be helpful with both approaches - and necessary for the latter. Bliss maintains the list of concepts, which is updated regularly. They partnered with W3C to make that list (registry) more easily reachable.

To answer your first question: I see your point. I don't think that Unicode is pursuing such work, though. We can check with our Bliss expert.

Thanks for your question and suggestion - does this clarify things?

russellgalvin commented 2 weeks ago

The list of zero width joiner emoji sequences maintained by Unicode (https://www.unicode.org/emoji/charts/emoji-zwj-sequences.html) is indeed analogous to what a registry of the Blissymbol vocabulary would look like. However, note that it is maintained not as a registry but as a "recommended list"...and this is for a symbol set with billions of users. The AAC world is much smaller than that and I'm not sure we would find sufficient Unicode support. No harm in asking, though.