bug: khmer model custom wordbreaker issues

keymanapp / lexical-models

Lexical language models for predictive text

MIT License

13 stars 38 forks source link

bug: khmer model custom wordbreaker issues #230

Open MakaraSok opened 1 year ago

MakaraSok commented 1 year ago

Describe the bug

The crash happened after this activity was done. See the crash in action:

https://github.com/keymanapp/keyman/assets/28331388/9bd73f47-966c-4177-991c-e733ff0a0c71

Reproduce the bug

No response

Expected behavior

No response

Related issues

No response

Keyman apps

[ ] Keyman for Android
[X] Keyman for iPhone and iPad
[ ] Keyman for Linux
[ ] Keyman for macOS
[ ] Keyman for Windows
[ ] Keyman Developer
[ ] KeymanWeb
[ ] Other - give details at bottom of form

Keyman version

17.0.104-alpha

Operating system

iOS 16.4

Device

iPhone Pro Max Simulator

Target application

No response

Browser

No response

Keyboard name

sil_jarai

Keyboard version

1.0

Language name

Jarai

Additional context

https://keyman.com/keyboards/sil_jarai?bcp47=jra-khmr

jahorton commented 11 months ago

This is essentially the same issue seen at keymanapp/keyman#6900, but conflated with issues that arise when handling Khmer script.

Relevant codeblock from the corresponding lexical model:

https://github.com/keymanapp/lexical-models/blob/0491eb0a136c1ddfeaac00c376f862d72cea7a13/release/sil/sil.jra-khmr.jarai/source/sil.jra-khmr.jarai.model.ts#L12-L21

  wordBreaker: function(str: string) {
    return str.split(/\s/).map(function(token) {
      return {
        left: str.indexOf(token),
        start: str.indexOf(token),
        right: str.indexOf(token) + token.length,
        end: str.indexOf(token) + token.length,
        text: token
      }
    });

For starters, note that this "wordbreaker" was always intended to be something of a stand-in until we develop a better way to handle cases with scripts that don't normally do wordbreaking. (The majority language for the script is Khmer, which doesn't... even if Jarai itself does.)

Furthermore, this wordbreaker is not aware of any implicit meaning behind any punctuation marks in the script - it only breaks on spaces and nothing else. Thus, the guillemets (the double angle-brackets acting as quotation marks) are considered the same as letters and thus part of the same word.

Refer to the video associated with keymanapp/keyman#6900:

https://user-images.githubusercontent.com/28331388/176589116-07c5b8b2-39bb-42a1-8455-ff3f7a9d935f.mov

The guillemets are replaced because, as far as the system knows, they are part of the word, not separate. This, in turn, naturally has a strong knock-on effect of making predictions a lot more difficult. No Khmer word actually starts with a left-guillemet («), after all.

With my current attempts at reproducing it, the engine actually does recover on the first post-guillemet keystroke most of the time. Selecting such a suggestion also erases the guillemet due to the details noted above re: the model's wordbreaker. It also recovers instantly when starting a new word. Thus, it's not "crashing" - just "failing to find any suggestions."

Finally, note that the predictive-text engine will only allow so much corrections before it stops looking. Having to outright delete the « in order to make good suggestions for the text after it is quite costly, and that doesn't reset within the word at present. So, even when "working", corrections will seem markedly more limited in this context.

jahorton commented 11 months ago

Looking back through related issue and PR history, this thread seems particularly relevant: https://github.com/keymanapp/keyman/pull/6574/files#r861500917

If we did allow character-class overrides, that'd provide a way to avoid writing a complex custom wordbreaker. But, for now, perhaps I should just tweak this hacky would-be wordbreaker to hack off the « from the actual word.

jahorton commented 11 months ago

Here's my first-pass prototype at resolving this.

  wordBreaker: function(str: string) {
    const tokens = str.split(/\s/);

    for(let i=0; i < tokens.length; i++) {
      const token = tokens[i];
      if(token.length == 1) {
        continue;
      }

      // Opening quotes should be considered a separate token from the word they're next to.
      const punctuation = '«';
      let splitPoint = token.indexOf(punctuation);
      if(splitPoint > -1) {
        const left = token.substring(0, splitPoint);  // (0, -1) => ''
        const right = token.substring(splitPoint+1);  // Starting past the end of the string => ''

        if(left) {
          tokens.splice(i++, 0, left);
        }
        tokens.splice(i++, 1, punctuation);
        if(right) {
          tokens.splice(i, 0, right);
        }
        // Ensure that the next iteration puts `i` immediately after the punctuation token... even if
        // there was a `right` portion, as it may have extra marks that also need to be spun off.
        i--; 
      }
    }

    return tokens.map(function(token) {
      return {
        left: str.indexOf(token),
        start: str.indexOf(token),
        right: str.indexOf(token) + token.length,
        end: str.indexOf(token) + token.length,
        text: token
      }
    });

If there are other punctuation marks worth splitting off, I can extend it further, though there will be a bit of extra complexity needed: marks with earlier indices within a token should be processed before later indices for that same token. A bit of an edge case, to be sure, but it could matter at some point.

This suggestion has been tested locally with punctuation = ' and the string The quick brown 'fox' jumped over the lazy dog. 'qu'ot'at'i'o'n'. (The mangled 'qu'ot'at'i'o'n' was there to stress-test things.)

'fox' => ', fox, '
'qu'ot'at'i'o'n' => ', qu, ', ot, ', at, ', i, ', o, ', n, '

full output:

[
'The', 'quick', 'brown',  "'",
'fox', "'",     'jumped', 'over',
'the', 'lazy',  'dog.',   '',
"'",   'qu',    "'",      'ot',
"'",   'at',    "'",      'i',
"'",   'o',     "'",      'n',
"'"
]

Note that dog. remains because this code isn't checking for . - just '.

jahorton commented 10 months ago

Enhancing this to allow splitting off multiple punctuation marks, rather than just one...

  wordBreaker: function(str: string) {
    const tokens = str.split(/\s/);

    for(let i=0; i < tokens.length; i++) {
      const token = tokens[i];
      if(token.length == 1) {
        continue;
      }

      // Certain punctuation marks should be considered a separate token from the word they're next to.
      const punctuationMarks = ['«', '»' /* add extras here */];
      const punctSplitIndices = [];
      // Find if and where each mark exists within the token
      for(let i = 0; i < punctuationMarks.length; i++) {
        const split = token.indexOf(punctuationMarks[i]);
        if(split >= 0) {
          punctSplitIndices.push(splilt);
        }
      }
      // Sort and pick the earliest mark's location.  If none exists, use -1.
      punctSplitIndices.sort();
      const splitPoint = punctSplitIndices[0] || -1;

      if(splitPoint > -1) {
        const left = token.substring(0, splitPoint);  // (0, -1) => ''
        const right = token.substring(splitPoint+1);  // Starting past the end of the string => ''

        if(left) {
          tokens.splice(i++, 0, left);
        }
        tokens.splice(i++, 1, punctuation);
        if(right) {
          tokens.splice(i, 0, right);
        }
        // Ensure that the next iteration puts `i` immediately after the punctuation token... even if
        // there was a `right` portion, as it may have extra marks that also need to be spun off.
        i--; 
      }
    }

    return tokens.map(function(token) {
      return {
        left: str.indexOf(token),
        start: str.indexOf(token),
        right: str.indexOf(token) + token.length,
        end: str.indexOf(token) + token.length,
        text: token
      }
    });

As a reminder, this is a custom wordbreaker used within lexical-model projects. Anywhere you've used this one:

https://github.com/keymanapp/lexical-models/blob/0491eb0a136c1ddfeaac00c376f862d72cea7a13/release/sil/sil.jra-khmr.jarai/source/sil.jra-khmr.jarai.model.ts#L12-L21

This new one is an enhancement of that, allowing you to also split off whatever specific punctuation marks you define within the array saying to /* add extras here */.

Meng-Heng commented 10 months ago

Testing the topic issue:

«ផ any letter behind any symbols seems to show 2 separated lists of predicted words. We found that there is a tiny chance (not always) for the word prediction to not show up but it is intermittent. [Fixed with the code]
«ផ» any letter type after » will not have predicted words (when erasing a letter after », the predicted words seem to be associated with ផ -> the letter inside the brackets). [A separate issue]
###$%វស continuous (2 or more) symbols will also produce the issue of this topic. [A separate issue]

The issues above were produced on:

The Keyman keyboard in Web browser through Keyman Developer v 16.0.144
Keyman app for iOS v 16.0.143

Result

When applying this code to the .ts file, it does not solve the issue. Furthermore, It would not show any predicted words after typing a symbol (erasing the symbol or giving one space will show the suggested words). Why did the code not work?
However, this code work on the topic' issue. (Solve the issue number 1)

Please let me know what else can I do to clarify anything.

F.Y.I There were no changes to the two codes when implemented in the .ts file.

jahorton commented 10 months ago

After doing a little debugging on the newer one... which was admittedly not directly tested first...

wordBreaker: function (str) {
  const tokens = str.split(/\s/);

  for(let i=0; i < tokens.length; i++) {
    const token = tokens[i];
    if(token.length == 1) {
      continue;
    }

    // Certain punctuation marks should be considered a separate token from the word they're next to.
    const punctuationMarks = ['«', '»', /* add extras here */];
    const punctSplitIndices = [];

    // Find if and where each mark exists within the token
    for(let i = 0; i < punctuationMarks.length; i++) {
      const split = token.indexOf(punctuationMarks[i]);
      if(split >= 0) {
        punctSplitIndices.push(split);
      }
    }

    // Sort and pick the earliest mark's location.  If none exists, use -1.
    punctSplitIndices.sort();
    const splitPoint = punctSplitIndices[0] === undefined ? -1 : punctSplitIndices[0];

    if(splitPoint > -1) {
      const left = token.substring(0, splitPoint);  // (0, -1) => ''
      const punct = token.substring(splitPoint, splitPoint+1);
      const right = token.substring(splitPoint+1);  // Starting past the end of the string => ''

      if(left) {
        tokens.splice(i++, 0, left);
      }
      tokens.splice(i++, 1, punct);
      if(right) {
        tokens.splice(i, 0, right);
      }
      // Ensure that the next iteration puts `i` immediately after the punctuation token... even if
      // there was a `right` portion, as it may have extra marks that also need to be spun off.
      i--;
    }
  }

I've verified that this works correctly in isolation.

DavidLRowe commented 10 months ago

@jahorton What action is needed? Does the lexical model need revision to incorporate this revised word break code?

jahorton commented 10 months ago

It will work more in-line with user expectations. I'm trying to run it by our Khmer speakers first; Makara already mentioned some other punctuation and punctuation-like marks worth considering that I wouldn't otherwise be aware of, and there may be a few more that didn't come up in that brainstorm.

Meng-Heng commented 10 months ago

@jahorton The latest code is working to isolate the symbols or punctuation marks but the return tokens.map is missing (not predicting anything) so I added them and now it works!

Here is the full code:

      wordBreaker: function (str) {
  const tokens = str.split(/\s/);

  for(let i=0; i < tokens.length; i++) {
    const token = tokens[i];
    if(token.length == 1) {
      continue;
    }

    // Certain punctuation marks should be considered a separate token from the word they're next to.
    const punctuationMarks = ['«', '»', '$', '#' /* add extras here */];
    const punctSplitIndices = [];

    // Find if and where each mark exists within the token
    for(let i = 0; i < punctuationMarks.length; i++) {
      const split = token.indexOf(punctuationMarks[i]);
      if(split >= 0) {
        punctSplitIndices.push(split);
      }
    }

    // Sort and pick the earliest mark's location.  If none exists, use -1.
    punctSplitIndices.sort();
    const splitPoint = punctSplitIndices[0] === undefined ? -1 : punctSplitIndices[0];

    if(splitPoint > -1) {
      const left = token.substring(0, splitPoint);  // (0, -1) => ''
      const punct = token.substring(splitPoint, splitPoint+1);
      const right = token.substring(splitPoint+1);  // Starting past the end of the string => ''

      if(left) {
        tokens.splice(i++, 0, left);
      }
      tokens.splice(i++, 1, punct);
      if(right) {
        tokens.splice(i, 0, right);
      }
      // Ensure that the next iteration puts `i` immediately after the punctuation token... even if
      // there was a `right` portion, as it may have extra marks that also need to be spun off.
      i--;
    }
   }
   return tokens.map(function(token) {
      return {
        left: str.indexOf(token),
        start: str.indexOf(token),
        right: str.indexOf(token) + token.length,
        end: str.indexOf(token) + token.length,
        text: token
      }
    });
}

MakaraSok commented 10 months ago

It will work more in-line with user expectations. I'm trying to run it by our Khmer speakers first; Makara already mentioned some other punctuation and punctuation-like marks worth considering that I wouldn't otherwise be aware of, and there may be a few more that didn't come up in that brainstorm.

It looks like there are a whole lot of punctuations needed to be included in the array, i.e. #, @, (, ), [, ], {, }, `, ', ", <, > etc. The list seems endless, is there away to ensure that all punctuations are not to be affected when a word suggested is picked. For the time being, when you do, the word will replace what's being typed and the punctuation before it.

Just in case, see the behavior in the next comment for clarify.

Meng-Heng commented 10 months ago

https://youtu.be/TcQYFVeFTbo

DavidLRowe commented 9 months ago

Would it be possible to define the set of characters that are valid in a word (rather than the punctuation characters which might appear adjacent to a word), then split the token into strings of non-characters and characters?

Nnyny commented 9 months ago

@Meng-Heng, I found an issue with the above code. It create spacing after each words which isn't practical in writing Khmer. Added |\u200b into str.split fix the issue.

ref: [sil.km.ggc] remove space after words #241

Here's the code:

  wordBreaker: function (str) {
  const tokens = str.split(/\s|\u200b/);

  for(let i=0; i < tokens.length; i++) {
    const token = tokens[i];
    if(token.length == 1) {
      continue;
    }

    // Certain punctuation marks should be considered a separate token from the word they're next to.
    const punctuationMarks = ['«', '»', '$', '#' /* add extras here */];
    const punctSplitIndices = [];

    // Find if and where each mark exists within the token
    for(let i = 0; i < punctuationMarks.length; i++) {
      const split = token.indexOf(punctuationMarks[i]);
      if(split >= 0) {
        punctSplitIndices.push(split);
      }
    }

    // Sort and pick the earliest mark's location.  If none exists, use -1.
    punctSplitIndices.sort();
    const splitPoint = punctSplitIndices[0] === undefined ? -1 : punctSplitIndices[0];

    if(splitPoint > -1) {
      const left = token.substring(0, splitPoint);  // (0, -1) => ''
      const punct = token.substring(splitPoint, splitPoint+1);
      const right = token.substring(splitPoint+1);  // Starting past the end of the string => ''

      if(left) {
        tokens.splice(i++, 0, left);
      }
      tokens.splice(i++, 1, punct);
      if(right) {
        tokens.splice(i, 0, right);
      }
      // Ensure that the next iteration puts `i` immediately after the punctuation token... even if
      // there was a `right` portion, as it may have extra marks that also need to be spun off.
      i--;
    }
   }
   return tokens.map(function(token) {
      return {
        left: str.indexOf(token),
        start: str.indexOf(token),
        right: str.indexOf(token) + token.length,
        end: str.indexOf(token) + token.length,
        text: token
      }
    });
}