openstenoproject / plover

Open source stenotype engine
http://opensteno.org/plover
GNU General Public License v2.0
2.33k stars 281 forks source link

Suggestions for New Orthography Rules #1529

Open tac-tics opened 2 years ago

tac-tics commented 2 years ago

I was doing an analysis on the main.json file. I noticed that there are many words which are included in the dictionary, but which don't strictly speaking need to be. Specifically, I noticed that there is a large class of words which, if you were to remove the entry and then type the same sequence of strokes, the remaining entries in the dictionary, together with the English orthography rules, would produce the same result.

For example, there are two entries:

    "TKPWROE": "grow",
    "TKPWROE/-G": "growing",

But if you were to remove the latter, the stroke sequence: TKPWROE/-G would still result in the word "growing".

I was working to remove these words from main.json. But I noticed that while this removed around 3000 entries, it left some 10,000 entries untouched. When I went to investigate, I found a number of opportunities to encode additional orthography rules into Plover.

I will present a few of these rules with examples:

TOR
    (STPHAT) senate + (TOR) tor = senatetor != senator
    (STRUBGT) instruct + (TOR) tor = instructtor != instructor
    (SKWREPB/RAEUT) generate + (TOR) tor = generatetor != generator
    (SKAOUT) execute + (TOR) tor = executetor != executor
    (SKUPLT) sculpt + (TOR) tor = sculpttor != sculptor

KAL
    (STPAOER) sphere + (KAL) cal = spherecal != spherical
    (STA/TEUS/TEUBG) statistic + (KAL) cal = statisticcal != statistical
    (SPHET/REUBG) symmetric + (KAL) cal = symmetriccal != symmetrical
    (SAOEUBG/HREUBG) cyclic + (KAL) cal = cycliccal != cyclical
    (SURPBLG) surge + (KAL) cal = surgecal != surgical
    (TKPWRAF) graph + (KAL) cal = graphcal != graphical

TER
    (SOFT) soft + (TER) ter = softter != softer
    (KPHAOUT) commute + (TER) ter = commuteter != commuter
    (THERPLT) thermometer + (TER) ter = thermometerter != thermometer

-L
    (STPHUG) snug + (-L) le = snugle != snuggle
    (STUB) stub + (-L) le = stuble != stubble
    (SKWREUG) jig + (-L) le = jigle != jiggle
    (SKWRUPBG) junk + (-L) le = junkle != jungle
    (SKWRUG) jug + (-L) le = jugle != juggle
    (SKWAUB) squab + (-L) le = squable != squabble
    (SKWAB) squab + (-L) le = squable != squabble
    (SKRAP) scrap + (-L) le = scraple != scrapple
    (SPHUG) smug + (-L) le = smugle != smuggle
    (SPEUT) spit + (-L) le = spitle != spittle
    (SHUT) shut + (-L) le = shutle != shuttle

-LG
    (STPHUG) snug + (-LG) ling = snugling != snuggling
    (SKWROG) jog + (-LG) ling = jogling != joggling
    (SKWREUG) jig + (-LG) ling = jigling != jiggling
    (SKWRUG) jug + (-LG) ling = jugling != juggling
    (SKWAUB) squab + (-LG) ling = squabling != squabbling
    (SPWAPBG) shebang + (-LG) ling = shebangling != entangling
    (SPHUG) smug + (-LG) ling = smugling != smuggling
    (SHUT) shut + (-LG) ling = shutling != shuttling

TAL
    (STKEPB) accident + (TAL) tal = accidenttal != accidental
    (SKWRUPLT) judgment + (TAL) tal = judgmenttal != judgmental
    (SR*ES) vest + (TAL) tal = vesttal != vestal
    (TKEPBT) dent + (TAL) tal = denttal != dental

PEU
    (STUFRP) stump + (PEU) py = stumppy != stumpy
    (SKWRUFRP) jump + (PEU) py = jumppy != jumpy
    (SKEUFRP) skimp + (PEU) py = skimppy != skimpy
    (SWAFRP) swamp + (PEU) py = swamppy != swampy
    (KRAOEP) creep + (PEU) py = creeppy != creepy
    (WAOEP) weep + (PEU) py = weeppy != weepy
    (PHOEP) mope + (PEU) py = mopepy != mopey

Each suffix in this list appears to allow for the removal of a little less than a hundred words from the dictionary.

user202729 commented 2 years ago

Interesting, but probably not of a particularly practical interest. (anyone want to try fitting Plover into the Georgi?)

Computers have enough memory for main.json.

Side note, having the entries explicitly in main.json facilitate reverse word lookup. (nevertheless it's not very consistent, so user still need to look out a bit there)

AlexandraAlter commented 2 years ago

Ooh, I'd definitely be interested in helping make main.json a bit more maintainable. I've been scouring through it for a while trying to find other inconsistencies.

JoshuaGrams commented 2 years ago

I feel like I'd just use the appropriate suffixes instead of doubling the consonants in most of these cases? Use -or (O*R), -er (E*R), and -al (A*L) instead of -tor, -ter, and -tal, -ical (K*L) instead of -cal, -y (KWREU) instead of -py

tac-tics commented 2 years ago

Computers have enough memory for main.json.

In my view, memory isn't the only reason to keep a trim standard dictionary. The original impetus for me doing this was because I'm learning steno, and it is too much to see 6+ candidates for a word. It's hard to pick out which is "canonical spelling" for words I'm learning.

There are many entries which appear to be there to compensate for typos. While I'm sure these entries provide value to the original author, they detract from the beginner's experience.

I say this, of course, with the deepest respect for the people who have contributed (either directly or indirectly) to the creation of this base dictionary.

I feel like I'd just use the appropriate suffixes instead of doubling the consonants in most of these cases? Use -or (OR), -er (ER), and -al (AL) instead of -tor, -ter, and -tal, -ical (KL) instead of -cal, -y (KWREU) instead of -py

I agree that, in some sense, these aren't "English" orthography rules. But they might still be useful if "contract + tor" is an accepted input for "contractor", etc.

tac-tics commented 2 years ago

I've been scouring through it for a while trying to find other inconsistencies.

I have a set of notes I created the other day which outlines a bunch of things I'd love to see improved about the base dictionary. These are my opinions, so just bear that in mind and disagree as much as you like 😄. Here are the take-aways:

If I were to summarize my aims (again, these are my personal aims, and others are free to feel differently or in exact opposition):

AlexandraAlter commented 2 years ago

The main problem with removing anything is that it might be a pretty significant and disruptive change for people who're used to all of the default phrases, shortcuts, etc. I've been trying to avoid removing anything even in my personal work, but, I've also been splitting the dictionary out into punctuation, fingers, numbers, names, phrases, until the 'base' dictionary is mostly verbs, adverbs, and common nouns.

There are other projects that have done similar. https://github.com/didoesdigital/steno-dictionaries is an excellent collection of just that kind, though, even that has a lot of oddities, since it's incredibly hard to untangle all of the mis-stroke entries from the canonical strokes they're there to patch.

It's a difficult one! The main.json is always going to be a starting point, and not a comprehensive one-fits-all solution. My curiousity is whether there's anything that can be done to keep the full dictionary intact and powerful, while making it more interally consistent. It's possible that a simplified learning dictionary would be a good thing to make in a separate repo.

user202729 commented 2 years ago

Yes, there has been several projects for categorizing main.json. Kaoffie's project is the last one I think? (not sure how much that one progressed)