Handling contractions, regexes and related issues

ampli commented 8 years ago

I know this is too long... As least it is a try to document the current situation and the proposed solutions.

The old LG library handled word with contractions by stripping the contracted part. I didn't check, but it seemed it could even strip double contraction (not supported by dict then, and still now).

In an early tokenize code revision I thought it would a better idea to use the suffix-split code (that had been used up to that point only for Russian stem-suffix split) for that purpose. Since this code supports only one suffix, it could not do double contraction strip, a thing that was not important due to the said lack of dict support.

Using the suffix-split function for contraction separation created a problem with applying regexes to words with contractions. Here is more background on the relevant aspect of the current tokenizer (these aspects are the same as in version 4 LG).

When a word is not in the dictionary, the tokenizer tries to see if it matches a regex (from the file 4.0.regex). This is a kind of a guess that classifies unknown words. (If such an unknown word doesn't match a regex, a speller is optionally used to split it to 2 words or even more, or to replace it with words with a similar spelling.) Note that capital words that match the CAPITALIZED-WORDS or PL-CAPITALIZED-WORDS regexes don't really use this regex classifying facility because only the first regex match is used (they have no opportunity for spell correction too because they match a regex).

A word that can split to stem-suffix is considered to be "in the dictionary", and hence a regex match is not used on it. This is problematic by itself because some specific stem and suffix may not have dictionary links defined between them, so a regex match or even a spell correction could be useful. In languages like Hebrew (multi-prefix / multi-suffix) and maybe Turkish (multi-suffix) this is even more problematic.

Example: An Hebrew a word ABCD that gets split to A= B= CD (when = denotes a "prefix subword") when CD is a know word, may be null-linked because it should have really got broken to A= BCD and BCD is an unknown word (that could be resolved fine by the UNKNOWN-WORD device of dict if the opportunity had been given). The problem of allowing splits with unknown words by default, when there is already at least one split with a known word, is that if a split with an known word is correct, the parses with the unknown words are redundant. This can be fixed by postprocessing, a thing that can be used for other things too, like the last part of handling phantom words. BTW, in general it is not possible to find out by "mini-LG" on a word if its split is valid, because its morphemes may have links to other words (in Russian it is possible since the LL link is local to the word).

Returning to words with contractions: As said above they now use the suffix-split function. But even if they can get split, not like stem-suffix which can split, they still need a regex lookup due to the DECADE-DATE regex! (Only this one regex.).

DECADE-DATE: /^([1-4][0-9][0-9]|[1-9][0-9])0(s|'s|’s)$/

So the current code checks if a word includes a contraction (is_contraction_word()) in order to make a regex check. (I have just fixed a bug in it - will send a pull request soon).

Solutions:

Move contraction split back to punctuation strip. This will restore double contraction ability.
Continue with the current check of is_contraction_word(), as double contraction should be supported in the new (yet unpublished) multi prefix-suffix split function.
Convert the DECADE-DATE regex to dictionary definitions. This is possible by changing it to

DECADE-NUMBER: /^([1-4][0-9][0-9]|[1-9][0-9])0/

and define its linkage to 's in the dictionary.

The current is_contraction_word() usage is problematic from other aspects:

It is English-centric. Other languages use apostrophes for possibly totally other purposes.
The quote marks are currently hard coded. Moving them to the 4.0.affix file will also support disabling is_contraction_word() (by not defining it).

All of that needs more work, but for now I will only send the said bug fix for is_contraction_word() and move on to finish the multi-prefix-suffix strip support (that it itself an intermediate code before introducing a more general tokenizer.

linas commented 8 years ago

Fixing DECADE-DATE is not that hard. Might need some variant of the YS connector, specially designed for decade numbers. Something like that. Or maybe a YN connector. Is this a blocker?

ampli commented 8 years ago

It is not a blocker since it is possible to check that a word is a contraction and allow a "parallel regex" on it even if it can get split, like the code now.

opencog / link-grammar

Handling contractions, regexes and related issues #268