keymanapp / lexical-models

Lexical language models for predictive text
MIT License
13 stars 37 forks source link

[cjp-latn] update model for automatic case selection #111

Open DavidLRowe opened 3 years ago

DavidLRowe commented 3 years ago

[gonzalez_quint_coto.cjp-latn.cabecar] Keyman version 14 has added the possibility for automatic case selection in predictive text models. This only applies to languages with upper/lower case distinctions (Latin and Cyrillic scripts, for example). Not only is Keyman Developer 14 required, but there needs to be a change in the lexical model source file. There's a new property for lexical model source files that must be set in order for automatic casing to work.

    languageUsesCasing: true

It's set in .ts file, in the same place as the format, wordBreaker and sources properties. For example, the existing file might look like:

const source: LexicalModelSource = {
  format: 'trie-1.0',
  wordBreaker: 'default',
  sources: ['wordlist.tsv'],
};
export default source;

And, with the addition of the new property, like:

const source: LexicalModelSource = {
  format: 'trie-1.0',
  wordBreaker: 'default',
  sources: ['wordlist.tsv'],
  languageUsesCasing: true,
};
export default source;

This will turn on the possibility for case differentiation and use the default configuration. Most likely this default operation will be all you need. In that case you don't need any customization. If you do need to control how capitalization works, please consult the discussion in https://github.com/keymanapp/keyman/issues/3720 "Example for Turkish".

In addition, you'll need to change the version number and (probably) the copyright date, which will require you to update some other files:

(1) HISTORY.md will need a new entry with the new version number and the date of the change, something like:

1.1 (2021-01-31)
----------------
* Enables use of Keyman 14's case-detection & capitalization modeling features

Normally entries in this file are ordered with the latest date at the top of the list.

(2) README.md will need the version number changed. Probably the copyright date (or date range) will need to change as well, for example from "(c) 2020 Acme, Inc." to "(c) 2020-2021 Acme, Inc."

(3) LICENSE.md will need the same copyright change as used in README.md.

(4) The version number needs to be changed in the .kps file. In Keyman Developer, use "Packaging" to get to the .kps file, then on the "Details" tab update the version number and (if needed) the copyright statement.

(5) If you have a copyright statement in a "readme.htm" or a "welcome.htm" file, this will need to be updated with the same copyright change used in README.md. (Since these files are covered by the copyright statement in LICENSE.md, you are free to omit the copyright statement from the individual files, which can make for less work when updating the model.)

DavidLRowe commented 3 years ago

@mcdurdin In reference to point (5) above, are these two files even needed at all? Would it be better to tell people to just delete these files (rather than fix up the copyright date), especially if they haven't made any changes?

mcdurdin commented 3 years ago

@DavidLRowe, I would love to revisit the whole "update the date" copyright discussion. Many large organisations (e.g. Google) no longer use the date in copyright statements. Given copyright is life of author + 50-70 years (depending on country), it seems almost meaningless -- and even more so for open source materials.

Furthermore, copyright is automatic anyway -- with or without a statement. The statement is just a way of recording ownership, not a legal mechanism for establishing copyright.

All the keyboards have copyright even if they have no copyright statement.

So, for the purposes of the keyboards and models repositories, I wonder if we can reduce the busy work by putting the date in only one file, maybe README.md "© 2015-2021 Author" (as the most likely file to be updated for other reasons ... although HISTORY.md has a stronger claim here?), but in any other file where the developer wants to have a copyright statement, we can have just "© Author".

The only time this date would ever be relevant is in a court case, in which a declaration like this is hardly proof anyway. File dates and committer metadata as shown in the GitHub repository are surely more consequential.

mcdurdin commented 3 years ago

In reference to point (5) above, are these two files even needed at all?

Currently we do use readme.htm (and welcome.htm? @darcywong00 @jahorton is that right?) for lexical models, if they are installed manually rather than through the Keyman Cloud. welcome.htm is also accessible I think after installation in the lexical model settings.

DavidLRowe commented 3 years ago

I moved the copyright discussion to a new issue in the keyboards repo.