Closed mcdurdin closed 1 year ago
@mcdurdin how do we know if a name is problematical? The newest langtags weeds out pejorative names. The new names use commas and the old names don't. Is that a problem?
@mcdurdin
Targets
for athinkra_vai
is store(&TARGETS) 'web desktop'
.js
file in the .kps
keyboard_info
does not have anything about targets. .js
file in the .kmp
So what is the issue? Is it just that it's in the wrong order (should be desktop web
)?
I looked at one other keyboard and it also had store(&TARGETS) 'web desktop'
If that is throwing it off, we should be able to just update the .kmn and not doing any version changes, correct?
how do we know if a name is problematical? The newest langtags weeds out pejorative names. The new names use commas and the old names don't. Is that a problem?
All super good questions @LornaSIL :grin: At this point, I think a quick sanity check is sufficient.
Targets
forathinkra_vai
isstore(&TARGETS) 'web desktop'
- There is no
.js
file in the.kps
keyboard_info
does not have anything about targets.- I downloaded the .kmp file and there is no
.js
file in the.kmp
So what is the issue? Is it just that it's in the wrong order (should be
desktop web
)?
Okay, perhaps my table was unclear. The 'unexpected platforms' column shows places where the old compiler was giving us targets such as mobileWeb
which we probably don't want. The new compiler is giving us better data overall, and so the 'quick sanity check' here is probably just a scan down the column from your perspective to see if anything stands out as obviously wrong. I saw nothing wrong when I checked, so this table is as much for documentation of the change as anything.
The names look fine. I looked at the targets and they all seemed correct, but I did a PR to tidy up all the targets statements to the minimal statement. No change to version numbers.
@mcdurdin Minor FYI re: (We are on 1.3.1, which is latest published version AFAICT)
The record has:
"api": "1.3.1",
"date": "2023-05-02",
"tag": "_version"
So the 1.3.1 is the version of the API and (hopefully) won't change too often. The date is from the last release. We hope to make another release tomorrow.
Ah, gotcha! We are currently on 2023-05-04 and don't plan to update to the next version until the next major release now:
https://github.com/keymanapp/keyman/blob/master/resources/standards-data/langtags/langtags.json#L15-L19 currently shows:
{
"api": "1.3.1",
"date": "2023-05-04",
"tag": "_version"
},
I am currently doing a rewrite of the Dinka keyboard, and noticed this issue, I take it that you are moving from a BCP47 definition to a CLDR definition of the language subtags? If so can we use the -x- extension in language tags?
On Fri, 4 Aug 2023, 09:03 Marc Durdin, @.***> wrote:
Ah, gotcha! We are currently on 2023-05-04 and don't plan to update to the next version until the next major release now:
https://github.com/keymanapp/keyman/blob/master/resources/standards-data/langtags/langtags.json#L15-L19 currently shows:
{ "api": "1.3.1", "date": "2023-05-04", "tag": "_version" },
— Reply to this email directly, view it on GitHub https://github.com/keymanapp/keyboards/issues/2311#issuecomment-1664742606, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALGM67FK7QIV5CIR5KMR4LXTQU2NANCNFSM6AAAAAA2WRAVRI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I take it that you are moving from a BCP47 definition to a CLDR definition of the language subtags? If so can we use the -x- extension in language tags?
Not quite. First, we won't support -x-
extensions until 18.0.
Second, our restriction was the lang-script-region
subtag triplet because of various operating systems that didn't support more expressive tags. We are moving towards defining the best subtag for the keyboard, and gracefully degrading the subtag for those OSes that don't support arbitrary tags. It wasn't really a BCP47 vs CLDR thing.
Given this line in above message:
din-Latn el_dinka Dinka (Latin) Dinka, Southwestern (Latin)
What would the language name for din-Latn resolve to?
Since i will need to distinguish between Dinka (Latin) and Dinka, Southwestern (Latin)
On Wed, 20 Sept 2023, 11:17 Marc Durdin, @.***> wrote:
I take it that you are moving from a BCP47 definition to a CLDR definition of the language subtags? If so can we use the -x- extension in language tags?
Not quite. First, we won't support -x- extensions until 18.0.
Second, our restriction was the lang-script-region subtag triplet because of various operating systems that didn't support more expressive tags. We are moving towards defining the best subtag for the keyboard, and gracefully degrading the subtag for those OSes that don't support arbitrary tags. It wasn't really a BCP47 vs CLDR thing.
— Reply to this email directly, view it on GitHub https://github.com/keymanapp/keyboards/issues/2311#issuecomment-1726742537, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALGM64WJIG374H7ZFLAAGDX3I72HANCNFSM6AAAAAA2WRAVRI . You are receiving this because you commented.Message ID: @.***>
What would the language name for din-Latn resolve to? Since i will need to distinguish between Dinka (Latin) and Dinka, Southwestern (Latin)
I'm curious: what is the distinction that you need to work with? From langtags.json, din-Latn
resolves to 'Dinka (Latin)'. Furthermore, the minimal tag is din
(only Windows needs -Latn
suffix).
{
"full": "din-Latn-SS",
"iana": [ "Dinka" ],
"iso639_3": "din",
"localname": "Thuɔŋjäŋ",
"localnames": [ "Thuɔŋjäŋ" ],
"name": "Dinka",
"names": [ "Dinka, Southwestern", "Thoŋ ë Muɔnyjäŋ", "Thuɔŋjäŋ", "Western Dinka" ],
"region": "SS",
"regionname": "South Sudan",
"script": "Latn",
"sldr": true,
"tag": "din",
"tags": [ "dik", "dik-Latn", "dik-Latn-SS", "dik-SS", "din-Latn", "din-SS" ],
"windows": "din-Latn"
}
Ahhh, it is using the CLDR definition.
In BCP-47 din
is a macrolanguage
In CLDR din
is equated with dik
, with din
as preferred form.
What I will need to do is distinguish between the unified orthogrpahy and existing dialects, esp when it will come to the lexical models. So din
would cover an orthography and grammar that is cross dialectical, and the individual language codes including dik
would represent the existing dialect specific approaches. So i would need din
and dik
to be contrastive. But from the data you include above, din
and dik
are not contrastive, i.e. the CLDR approach where a macrolanguage code is equated with a specific language.
I guess I'd need to log an application for a new variant subtag for BCP-47, applied to all six language subtags. But can variant subtags be used in Keyman?
@srl295, @DavidLRowe, thoughts?
But can variant subtags be used in Keyman?
In v18 this will be possible. But let's see what others suggest first as well
@andjc I was also curious as to what you meant by "BCP47 vs CLDR". This clarifies somewhat. Encompassed languages are part of the BCP47 spec though, see https://www.rfc-editor.org/rfc/rfc5646.html#section-4.1.2
@mcdurdin Are you saying that Keyman wouldn't allow a din
keyboard contrasting with a dik
keyboard?
My reading of BCP47, as pertaining to Dinka is that applications may (and CLDR locale data prefers to) use din
(macro) to refer to the primary encompassed language, dik
, but it also allows applications to choose to use the specific encompassed tags such as dik
. So you could have data tagged dik
, dip
, diw
etc.
But this is for a language, not an orthography. I think din
vs dik
could be used contrastively as to a language group vs. individuals, but I don't think it should be used contrastively for indicating an orthography distinction.
If the unified orthography is the expected default (i.e. what you get when you request bare din
or even dik
as languages), then what I'd recommend is a new subtag of some form for the pre-unified. Perhaps something similar to the following (which is a unified historical variant, so the opposite case in some sense).
Type: variant
Subtag: baku1926
Description: Unified Turkic Latin Alphabet (Historical)
Added: 2007-04-18
Prefix: az
Prefix: ba
Prefix: crh
Prefix: kk
Prefix: krc
Prefix: ky
Prefix: sah
Prefix: tk
Prefix: tt
Prefix: uz
Comments: Denotes alphabet used in Turkic republics/regions of the
former USSR in late 1920s, and throughout 1930s, which aspired to
represent equivalent phonemes in a unified fashion. Also known as: New
Turkic Alphabet; Birlәşdirilmiş Jeni Tyrk
Әlifbasь (Birlesdirilmis Jeni Tyrk Elifbasi);
Jaŋalif (Janalif).
IIUC there are five languages that are identified with the name "Dinka": dip Northeastern Dinka diw Northwestern Dinka dib South Central Dinka dks Southeastern Dinka dik Southwestern Dinka
In addition there is: din Dinka macrolanguage.
dik Southwestern Dinka is considered the representative language for din, and so din is used instead of (is preferred over) dik.
From Keyman's point of view, a keyboard for Southwestern Dinka, should use din (rather than dik) as the BCP 47 code and (ideally) should include all the characters needed to type any of the other four languages included in the Dinka macro language.
Steven mentioned section 4.1.2 of RFC 5646 which defines BCP 47. That does allow din-dik, din-dip, etc. as valid BCP 47 codes that are equivalent to dik, dip, etc. But I'm not sure that gets you any further. (And I don't know that Keyman would swallow them!)
I don't know if any of that is useful for your specific case.
Note: reopening this issue so it is visible due to current conversation. We can close again once we are happy with the outcome, or move the conversation to a new issue.
I must admit after reading all this I still don't know the answers! This aspect of BCP47 breaks my brain every time I run across it.
@mcdurdin Are you saying that Keyman wouldn't allow a din keyboard contrasting with a dik keyboard?
Per langtags.json, as shown above, Keyman would normalize dik -> din.
There is no official orthography per se, unfortunately there is no real National Language Policy .
In actual use in South Sudan and across the diaspora, you will font the pre-1990s orthography in use; the 1990s orthography, and more recently the Unified orthography and grammar.
There are no real corpora available. The are word frequency lists based on the Rek and Pandang Bibles. But the Bible's, if I remember correctly are copyrighted so the legality of the word lists is questionable. Both of these are based on the current (1990s orthographies for each dialect). So far the Bor bible hasn't been datamined.
There is Wikipedia, but most of the articles would be based on the Unified orthography.
In terms of keyboards ... the orthographic variations don't really matter. Although the character repertoire needed for the Unified orthography and grammar is larger ... more exemplar characters. But that is neither here nor there in terms of language tagging and exposing the keyboard to users.
The real question is how to identify lexical models. Language tags will not work.
On Thu, 21 Sept 2023, 19:28 Marc Durdin, @.***> wrote:
Note: reopening this issue so it is visible due to current conversation. We can close again once we are happy with the outcome, or move the conversation to a new issue.
I must admit after reading all this I still don't know the answers! This aspect of BCP47 breaks my brain every time I run across it.
@mcdurdin https://github.com/mcdurdin Are you saying that Keyman wouldn't allow a din keyboard contrasting with a dik keyboard?
Per langtags.json, as shown above, Keyman would normalize dik -> din.
— Reply to this email directly, view it on GitHub https://github.com/keymanapp/keyboards/issues/2311#issuecomment-1729201758, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALGM66XI77Y7T3WDQMPFX3X3QCDBANCNFSM6AAAAAA2WRAVRI . You are receiving this because you were mentioned.Message ID: @.***>
@andjc I think what you are saying is that the unified orthography is the 'default' orthography going forward. In that case, I would think that keyboards, and lexical models, for unified orthography could use the following (trying to make a concrete proposal):
dip
Northeastern Dinkadiw
Northwestern Dinkadib
South Central Dinkadks
Southeastern Dinkadin
for Southwestern Dinka (encompassed dik
)The exemplars do matter in principle for the various orthographies, but I hear you that pragmatically it's not going to make as much of a difference.
Then, for lexical models targetting prior orthographies, or other variations, I would use some kind of variant tag: (none of the below are registered currently of course)
din-di1990
perhaps for a 1990s orthography southwestern dinkadks-di1990
for 1990s orthography southeastern dinkadiw-rejaf
for the 1928 (pre 1990s?) orthography ( per omniglot )via -u- extension it could be perhaps din-u-va-di1990
or diw-u-va-rejaf
edit What I'm trying to say is that, generally, i'd support some other kind of tag as appropriate for the variations mentioned here. Yes, one can find examples of 3- and even 2- letter language codes that are arguably dialects or orthography distinctions of each other, but my understanding is that that isn't necessarily a justification for creation of a new language code.
@andjc If you'd like, you could consider filing a CLDR ticket with this use case to see if there would be CLDR-TC support or formal guidance on this use case (get some other BCP47 eyes on it), or support for an iana variant registration.
Steven, I'd tend to go the other way level the 1990s orth as default, and unified as variant. At this point of the game hard to tell of the unified will become the defacto standard or not.
Also means minimal change. Since everything currently language tagged would remain the same, rather than everything suddenly becoming mistagged.
Yep in terms of exemplar characters, the same keyboard will support all, at least in the case of this keyboard.
On Fri, 22 Sept 2023, 03:00 Steven R. Loomis, @.***> wrote:
@andjc https://github.com/andjc I think what you are saying is that the unified orthography is the 'default' orthography going forward. In that case, I would think that keyboards, and lexical models, for unified orthography could use the following (trying to make a concrete proposal):
- dip Northeastern Dinka
- diw Northwestern Dinka
- dib South Central Dinka
- dks Southeastern Dinka
- din for Southwestern Dinka (encompassed dik)
The exemplars do matter in principle for the various orthographies, but I hear you that pragmatically it's not going to make as much of a difference.
Then, for lexical models targetting prior orthographies, or other variations, I would use some kind of variant tag: (none of the below are registered currently of course)
- din-di1990 perhaps for a 1990s orthography southwestern dinka
- dks-di1990 for 1990s orthography southeastern dinka
- diw-rejaf for the 1928 (pre 1990s?) orthography ( per omniglot https://www.omniglot.com/writing/dinka.php )
via -u- extension it could be perhaps din-u-va-di1990 or diw-u-va-rejaf
— Reply to this email directly, view it on GitHub https://github.com/keymanapp/keyboards/issues/2311#issuecomment-1729966588, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALGM67PTKI444UHGQOIO3LX3RXC7ANCNFSM6AAAAAA2WRAVRI . You are receiving this because you were mentioned.Message ID: @.***>
The v17.0 compiler has two minor differences in how it builds .keyboard_info files.
Platform support differences
The old compiler could not detect if a given keyboard was web or mobile -- so it erred on the side of listing both. The new compiler uses the
&targets
store consistently (for any keyboards that have a .kmx output). This means that a number of keyboards will no longer be listed as supporting mobileWeb, or android/ios/mobileWeb, or in a few cases, desktopWeb.We need to verify that these changes are not going to cause trouble by making keyboards unavailable where they should be available.
Language name differences
The old compiler used language-subtag-registry to determine language, script and region names. The new compiler makes use of langtags.json for language names, and Intl.DisplayNames for script and region names.
We should review the list of language names to ensure there are no problematic changes. Note that this data only affects the .keyboard_info files, not packages.