w3c / afrlreq

African language enablement for the Web
9 stars 6 forks source link

How is te-kerende used? #28

Open r12a opened 1 year ago

r12a commented 1 year ago

The Unicode Proposal for inclusion of te-kerende describes it as:

a character used to link compounds together.

and gives the following examples:

Screenshot 2023-03-13 at 13 42 16

It has been suggested that 'link compounds together' is not related to compound nouns, but is rather a special kind of distributional construction that N’ko authors sometimes mark this way. Can anyone explain this usage in a little more detail or provide me with some better wording for the lreq doc?

donaldsoncd commented 1 year ago

I was the one that brought this up.

I don't know how you should word things in the document, but these constructions are not Manding "compounds" in a linguistic sense.

Do you want an linguistic explanation of the the "distributive construction" or are you looking for an explanation of the orthographic convention used to represent it?

In Latin-based Bambara, there is no special way to mark a distributive construction like there is in the N'ko based tradition. For instance, in the Latin-based tradition:

ko o ko

'each and every affair' [NOTE: You glossed it differently in your examples image]

But in the N'ko orthographic tradition it would be done like this:

ߞߏ_ߏ_ߞߏ߫ {Kó ̀_ó ̀_kó} Ko o ko 'each and every affair'

[NOTE: I also see that your vowel length isn't right on mɔɔ ɔ mɔɔ. You also don't include tonal diacritics. Not sure what your convention for transliterating or interpreting N'ko in Latin-based orthography is in the document.]

r12a commented 1 year ago

[Just to be clear, the image containing examples above is screen snapped from the Unicode proposal for adding a character to the N'Ko block. The transcriptions are nothing to do with me.]

Looks like i did attempt some transcriptions at https://r12a.github.io/scripts/nkoo/nqo.html#word though (which may indeed need to be changed - i don't remember the source of those). In those notes i assume that one is supposed to use U+07FA NKO LAJANYALAN to represent the te-kerende. Any thoughts on that?

@donaldsoncd do you have a pointer to an explanation of these kinds of linguistic device are used? Searching doesn't seem to yield anything useful. I think i get the general idea, but is it just used for a handful of words, or can one generate one's own te-kerende linked sequences?

DD-fwd commented 1 year ago

I will love to hear @donaldsoncd's response. I am commenting on @r12a's question about "... it is just used for a handful of words, or ...?" I think this is a common expression in Mandin languages to express the individuality, repetitiveness, or infinity of related action. For example: su-u-su 'every night' in your posting can also be expressed for soma-a-soma (soma soma) 'every morning', tele-e-tele (tele tele) 'every afternoon', wura-a-wura (wura wura) 'every evening'. Same thing is possible for mo-o-mo 'everyone', ke-e-ke (ke ke) 'every man', moso-o-moso (moso moso) 'every woman'; you get the idea.

donaldsoncd commented 1 year ago

In Bambara and Jula, the distributive construction is built by inserting an o between two nouns like in all the examples given. It is infinitely productive. You can do it with any noun and it then means 'each/every/any X' depending on the context. For instance:

Cɛ o cɛ

'Each and every man'

Baara o baara

'Any (line of) work'

In some varieties of Manding (and in N'ko orthography), the vowel that is o in the first above example actually changes to harmonize with the noun (in other cases it basically is elided, but its tone [which I have ignored here both in terms of writing and its role in the grammatical construction itself] remains and influences the tonal realization of the two nouns). That is is why we have @DD-fwd's examples:

kɛ o kɛkɛ ɛ kɛkɛ-kɛ man DIST man

'each/every man'

In N'ko orthography the convention is to always write this grammatical construction with vowel harmony option (as well as the appropriate tonal diacritics since they play a role as well) PLUS the te-kerende underscore line.

For more details on this construction across Manding varieties, you could consult linguistic reference grammars such as:

r12a commented 1 year ago

Very helpful, @donaldsoncd. So would you agree that this is written using U+07FA NKO LAJANYALAN?

NeilSureshPatel commented 1 year ago

As far as I can tell, the te-kerende was never encoded. The lajanyalan is different since it connects to the letters on both sides. The te-kerende should not connect to the letters.

r12a commented 1 year ago

@NeilSureshPatel i concur about no encoding for a separate lajanyalan, but i didn't find any rationale, or indication of what should be used. I'll see whether i can get some enlightenment from the Unicode Editorial folks on Thursday.

In my examples i've been using lajanyalan surrounded by spaces to create the appearance.

NeilSureshPatel commented 1 year ago

@r12a I was just submitting an issue on the Noto N'ko repo and I saw this other issue. https://github.com/notofonts/nko/issues/5 that may hint at why the te-kerende wasn't encoded.

Part way down Denis says the following: "A resolution would be to add contextual positioning when 07FD NKO DANTAYALAN is next to a U+07F8 NKO COMMA, U+2010 HYPHEN, U+2011 NON-BREAKING HYPHEN. Note: U+2010 HYPHEN and U+2011 NON-BREAKING HYPHEN sit on the baseline in NKo, they need to be added to Noto Sans NKo."

This seems to suggest that the plan for N'ko is to use standard hyphens that are moved to the baseline. This seems a bit odd though.

donaldsoncd commented 1 year ago

Very helpful, @donaldsoncd. So would you agree that this is written using U+07FA NKO LAJANYALAN?

I know nothing about the encoding of this. I just do Latin underscores if I have to write it.

r12a commented 1 year ago

Debbie Anderson pointed me to a discussion at the UTC in 2016. See point 11 at https://www.unicode.org/L2/L2016/16037-script-rec.pdf.

The first character that is proposed, TE‐KERENDE, can be represented using U+2010 HYPHEN or U+2011 NON‐BREAKING HYPHEN, but would need to be designed in a font on the baseline. Note that U+2010 HYPHEN is used in such a way in Arabic text. The other three characters are well‐documented and straight‐forward.

This may be the source of the comments by @moyogo. I also think it's a bit odd. I took a look at the few resources i have to hand that provide selectable online text and found the following.

Wikipedia uses hyphen and lajanyalan on the same page, where the former look like ordinary hyphens (mid height, and no spacing), while the latter (surrounded by spaces) is used for what look like te-kerende. eg. ߖߌ߰ ߺ ߡߊ߬ ߺ ߖߊ߲߬ߝߊ߬ߓߊ߫ ߊߟߏ، ߋ-ߖߘߍ߬ߘߊ߲ߘߊ،ߌ- So it may be important to not make hyphens drop to the baseline and grow in size. As long as lajanyalan is surrounded by spaces (which appears to be the expected use), it seems to work more intuitively, visually.

Silabosoona at http://cormand.huma-num.fr/maninkabiblio/periodiques/silabosoona5.pdf also uses lajanyalan for te-kerende, but also for general phrase separators, eg.

ߊ߭ߜߊߘߡߝ ߺ ߋ ߺ ߋ߫ߝ߸ߋ߫ߦ ߏߟߍߘ ߍߞ ߺ ߂߂ ߀߂

NeilSureshPatel commented 1 year ago

This certainly is a bit messy. The standard mid height hyphen is used with numbers. This can be seen on page 1 of Silabosoona.

߆߂߁-߇߄-߀߃-߀߀

My guess is that the regular hyphen used in text on Wikipedia is more of a workaround rather than preference. It is intuitive to use it since it used in the Latin orthography, whereas typing spaces around a lajanyalan is less convenient.

The use of the lajanyalan with spaces does come with other problems. The lajanyalan is really wide compared to a te-kerende. This is exacerbated by the fact that is has negative side bearings for its normal joining behavior. When you add spaces the extra length becomes exposed. The other problem is that the parts of the lajanyalan that overlap with adjacent letters may not have square edges. This varies by font but there are times the bottoms need to be curved or chamfered so that it doesn't punch though the join between an adjacent letter and its baseline stroke. This can be more extreme if any negative kerning is used. For example:

image

If the edge were squared off the corner makes the join not smooth.

image

It subtle in this example. If one were to apply effects, like outlined text, etc this could become more obvious and problematic.

image

Without separate encoding, I think the best way to handle this is to have an alternate N'ko hyphen that is pushed down to the baseline which is replaced contextually (when nested between or following N'ko letters) in the font via rclt. This way the presentation can be controlled (squared edges, positive side bearings, narrower width, etc). If a font fails to do this you end up with a standard hyphen, without having to change the way the text is input. From what I recall, rclt works for N'ko shaping in all shaping engines. This is how I would be inclined to handle it anyway.

r12a commented 1 year ago

Hmm. Another part of the messiness is whether or not other people will be inclined to use the hyphen with the expectation that it will magically change position and shape in the required contexts, or will they (as they seem to be doing) just go for the thing that looks to them as if it's what they want to see on the page (ie. the lajanyalan). I looked at a number of other online resources, and those that contain te-kerende and dashes that separate phrases all use lajanyalan, so it seems it may have already become the de facto way of doing this.

I wonder whether it makes sense to do the opposite of what you're suggesting @NeilSureshPatel: ie. to fix the font so that the lajanyalan is the right width and has the right shaping when it appears between spaces. This may be an easier context to detect, given that spaces are, it seems, always present, and the joining behaviour is not relevant if spaces are on either side?

NeilSureshPatel commented 1 year ago

Ahh, yes good point @r12a. I guess once a workaround gets normalized we kind of have to work with it. I can see what you are suggesting working. The lajanyalan can be narrowed, squared off and have zero or near zero sidebearings. When strung together for justification it should still make a solid line.

A related approach is to take advantage of the fact the lajanyalan can have positional forms. Therefore, the isolated form can be more tuned for use as a te-kerende and then the positional forms can have more flexibility in design depending on the font. Spaces will break the shaping and default to the isolated form as you say. A thin space would be ideal over a word space but this can be handled with in a handful of different ways.

jfkthame commented 1 year ago

One issue with using lajanyalan surrounded by spaces is that this will tend to allow a line-break to happen either side of it, whereas my understanding is that if a line-break is needed, it should always occur after the te-kerende. In theory, if the preceding space were a non-breaking space, that wouldn't be a problem, but in practice users will inevitably type normal spaces most of the time.

r12a commented 1 year ago

I brought this up with the Script AdHoc (SAH) Unicode committee and consensus was reached that it is ok to use lajanyalan for te-kerende and certain other hyphen-like uses where the glyph is expected to look like a baseline extension surrounded by spaces.

NeilSureshPatel commented 1 year ago

Thanks for the update @r12a. I was curious to know where the discussion landed on the matter. What did the SAH say about the line-breaking concern that @jfkthame brought up? I think from a font production standpoint, I would still substitute a lajanyalan nested between spaces with an alternate form just to make it narrower and remove any modeling of the overlapping parts of the stroke.

r12a commented 1 year ago

@NeilSureshPatel The line-breaking discussion was put off for another day. A proposal would need to be submitted. Personally, i'm not so worried about that – just as with dashes in English, such as the one i just typed, people can use a nbsp if needed. I think the problem of handling line breaks around punctuation that is separated from the preceding text is a lot bigger than just N'Ko (think of dandas, French question marks, Mongolian commas, etc. etc.) and may need a more generalised solution.

I think that the proposal to shape the lajanyalan appropriately makes sense. I was planning to raise an issue in the Noto repo – would you prefer to do that? (You're better qualified than me to put the right points.)

Btw, i'm about to raise then close a gap report about this in our gap analysis framework, so that we can make the progress visible.

NeilSureshPatel commented 1 year ago

That makes sense, thanks. I'll take a look at the Noto design again to see if it would need to be adjusted and how. Noto uses very simple connections so it may only need a width adjustment. I'll raise an issue in the repo with the proper recommendation.

jfkthame commented 1 year ago

@NeilSureshPatel The line-breaking discussion was put off for another day. A proposal would need to be submitted. Personally, i'm not so worried about that – just as with hyphens, such as the one i just typed, people can use a nbsp if needed.

I wasn't part of the background discussion here, so may be missing lots of context. But personally, I think the conclusion is unfortunate, from a serving-the-users point of view.

Judging from the examples in Figure 5 of the Unicode proposal document, I don't think users would perceive the te-kerende as being separated by spaces from the surrounding words, so the natural instinct will be to type it without spaces. When they notice that this produces a joined form (because that's how lajanyalan behaves), they're just as likely to try something else such as a generic HYPHEN-MINUS or LOW LINE as to figure out that they should put spaces each side of it (and depending on the font in use, the result of adding spaces may look so bad — because lajanyalan is too long — that they reject that and go for HYPHEN-MINUS or even borrow the Arabic-script KASHIDA instead).

My suspicion is that "correct" use of <nbsp> <lajanyalan> <space> to represent te-kerende, along with a "smart" font that shapes lajanyalan appropriately for this context, will be an exotic rarity.

(The "hyphen" comparison isn't very persuasive, IMO. I notice that what's actually in your comment is not a hyphen but an en-dash — perhaps thanks to an autocorrect feature? When a punctuation dash with surrounding spaces is used in English, breaking the line before the dash — so that it appears at start-of-line — is much less jarring than breaking before te-kerende would be. The Latin-script analogue to N'Ko te-kerende would be a hyphen without any surrounding spaces, which does not permit a preceding break.)

r12a commented 1 year ago

My suspicion is that "correct" use of <nbsp> <lajanyalan> <space> to represent te-kerende, along with a "smart" font that shapes lajanyalan appropriately for this context, will be an exotic rarity.

One of the things driving this discussion was that i looked at a number of online texts to figure out what users do, and they all used <space><lajanyalan><space> for the te-kerende (and for various other hyphen/dash-like places).

It may be better to move this discussion to a separate issue focused on line-breaking for te-kerende.

(in my earlier comment i have just changed 'hyphens' to 'dashes in English', which i intend to cover hyphens and other dashes.)

r12a commented 1 year ago

... even borrow the Arabic-script KASHIDA instead

I'm not sure why they would do that, or why they would try not to use spaces. The lajanyalan is the N'Ko equivalent of the Arabic tatweel (which i assume you mean by kashida). And using it without spaces would immediately produce incorrect results, because (a) it would join with the adjacent characters (as would the tatweel), and (b) it wouldn't produce the gaps either side which always appear with te-kerende. So i don't think that users are likely to omit the spaces. (That said, for fine typography, they may perhaps choose slightly smaller spaces.)

NeilSureshPatel commented 1 year ago

These things are always weird. I think if the te-kerende were encoded from the get-go, it would have been used readily. However, without it the most convenient thing to do is <space><lajanyalan><space>, thus making it the typical method. Probably, what should have happened is that the lanjanyalan should not have been encoded, since the tatweel can be used for this purpose. The te-kerende should have been encoded instead.