opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
388 stars 118 forks source link

spell-guessing mis-handles capitalized words #404

Open linas opened 7 years ago

linas commented 7 years ago

Even with spell-guessing enabled, the sentence " Therefo, I declare I love Computer Science" is not fixed. However, the lower-case version of this works fine. I suggest that spell-guessing should down-case words before trying to correct them. (for aspell, I did not attempt Hunspell)

ampli commented 7 years ago

It even doesn't arrive the spell-guess step.

By design (per the original one), the lookup order of the tokenizer is as follows, and it stops on the first successful one (from my memory):

  1. Dictionary lookup of the whole word, as is and also downcased.
  2. Morpheme split (as is and also downcased).
  3. Regex lookup (capital words also match).
  4. Spell guess.

It means that a word that matches a regex doesn't pass a spell correction.

Similar problems happen with these sentences:

therfor, I declare ... therfor[!C-NOUN-WORDS].n , I.p declare.v ....y

elefants are big elefants[!S-WORDS].n are.v big.a

From tokenizer.c:

     * ??? Should we add spell guesses as alternatives in case:
     * 1. The word if not in the main dict but matches a regex.
     * 2. The word an unknown capitalized word.

If the answer to the above is "yes", I can implement that.

Also note that words that match a regex loss the opportunity to be handled by the unknown-word device! For handling such things we may need an additional kind of post-processing, plus maybe programmatically costs.

linas commented 7 years ago

Hmm.

If a word is capitalized and unknown, we should spell-guess the lower-case version. If a word ends with an s and is unknown, we should spell-guess it.

In other cases, we should not do so?

The point is, I don't want to run roman numerals through the spell-guesser, since I already think I know what they are.

Maybe in some ideal world, we should allow the dictionary to contain something like

foo: BAr+ or <spell-guess>;

and even assign costs:

foo: BAr+ or [<spell-guess>]3.14159;

so that the BAr+ connector gets priority, else the spell-cheeked version. .. Something like that. This would allow common typos to be automatically handled: e.g.

there: ... or [[<spell-guess>]];

which would add their, they're as possibilities, but at a large cost.

ampli commented 7 years ago

If a word is capitalized and unknown, we should spell-guess the lower-case version. If a word ends with an s and is unknown, we should spell-guess it.

In other cases, we should not do so?

I think we should always spell guess words that match regex, but discard the linkages with the guesses if the guessed words are unlinked. If the original word (that matched a regex) is linked then the guesses should get a high cost, else a low cost. It seems it needs a new kind of post processing or something. I'm not sure how it can be implemented.

foo: BAr+ or []3.14159;

If you write an exact specification, I can implement that.

there: ... or [[]];

But this is not exactly a spell guess - I don't know a spell guesser that gives guesses for correct words.

linas commented 7 years ago

Post-processing needs to be avoided at all costs. Post-processing is a fail.

I cannot write an exact spec just right now. My train of thought, though, is that everything we know about a word should be encodable in the dictionary -- I don't want to invent some new extra-dictionary format. So, the question becomes: how can I tell the tokenizer that capitalized words should be spell-guessed? The only obvious answer, for now, is to invent a new connector: say SPELL+ and then have the tokenizer use some code, such as

if (word_has_spell_connector(word)) {
    generate spell alternatives...
}

However, having costs is a problem: the Gword_struct does not have a cost-field in it, and that would need to be added. Then, during parsing, if this particular alternative is used, the cost of the alternative would have to be added to all of the connectors on it.

There's an unrelated, nice reason for having costs on alternatives: some spelling alternatives could be given lower costs than others. That way, if we also had a list of word frequency, we could make the cost be -log(frequency) so that the alternatives for "teh" would get "the" having a low cost, "tea" having a higher cost, and "Tet" having a very high cost.

linas commented 7 years ago

For alternatives to correctly-spelled words ... not sure. we need to add some sort alternatives file.

I don''t see how to jam alternatives into the current dictionary format. So maybe we do need some kind of new file format, something that would contain entries like:

there: they're their;
CAPITALIZED-WORDS: <spell-guess>;
all: all_of;

The last one is interesting, as it suggests that "all_of" is a valid alternative to "all" for example: "I ate all (of) the cookies." which solves the zero-word/phantom-word problem from issue #224

linas commented 7 years ago

one super-hacky way of indicating alternatives is to use word-subscripts. So, for example:

yisser.your: A+ & B-;

means that "yisser" is an acceptable Irish word, and it really means "your" . If we used a hash-mark instead of a period, it could mean "alternative"

there#their: [FOO+];

means if you see "there" during tokenization, then add as an alternative "their" so that, after parsing "It is hot there" would be printed instead of "It is hot their."

This is kind-of fun, because it can be used as a crude language-translation mechanism: "yisser#your" tells you how to translate the strange Irish word "yisser" into standard English.

ampli commented 7 years ago

Post-processing needs to be avoided at all costs. Post-processing is a fail.

But I think doing a kind of post-processing of sentences is a natural things people are doing when they read text, especially problematic text (.e.g. with errors).

I cannot write an exact spec just right now. My train of thought, though, is that everything we know about a word should be encodable in the dictionary -- I don't want to invent some new extra-dictionary format.

If you would like more info, e.g. for Unification Link Grammar or context sensitive parsing (which we have never discussed) it seems some different format is needed somehwre. For not introducing too much changes, I once suggested a two-step lookup, when you first look up the token in a special dictionary and consult the link-grammar dictionary only to find the disjuncts.

Then, during parsing, if this particular alternative is used, the cost of the alternative would have to be added to all of the connectors on it.

In "during parsing", do you mean during performing do_count()/mk_parse_set()?

ampli commented 7 years ago

For alternatives to correctly-spelled words ... not sure. we need to add some sort alternatives file. ...

there#their: [FOO+]; ...

I don't know if you considered my (tested) old suggestion (from the LG group). It uses the current dictionary format without any change.

For example, suppose we add these:

then.#than: than;
than.#then-i: then.i;
than.#then-r: then.r;

their.#there-r: there.r;
there.#their: their;

Then we get:

linkparser> this test is more costly then the previous one
...
LEFT-WALL this.d test.n is.v more costly.a then.#than the previous.a one 
linkparser> It is hot their
...
LEFT-WALL it is.v hot.a their.#there-r 

The sentence-checker can then mark the original sentence word "then" as having a correction "than". This can work for the "all_of" example too, if a bug of handling idioms (that I mentioned in my said group post) is fixed. Note that this works fine also if a word have several possible corrections (as demonstrated by my _P_ demo for prepositions).

The above method can be used even now. The "post processing" (which in this case means displaying which word can be replaced by which words) can be done by the application that uses the library. (Maybe the library can eventually include a higher-level part - maybe even implemented as a separate library, for doing such things.)

there#their: [FOO+]; means if you see "there" during tokenization, then add as an alternative "their" so that, after parsing "It is hot there" would be printed instead of "It is hot their."

This needs an additional mechanism than what you detail - when we put as an alternative the word "their", we need to somehow indicate it is not an original sentence word but is an alternative. We can do it in several ways. In my suggestion above, it is encoded in the subscript. It can also be encoded in a [] guess-notation, like their[<there]. Another possibility is to add a field in the disjunct or Gword structures - we just need something that survives the parsing step.

Please indicate a favorable way to do that, and I will try to write a demo.

linas commented 7 years ago

Ah hah! Silly me, you are right!

This works today, without any changes at all to the C code:

than.#then: [then.r]0.3333;

which gives

linkparser> shall we go, than?
Found 2 linkages (2 had no P.P. violations)
    Linkage 1, cost vector = (UNUSED=0 DIS= 0.33 LEN=8)

             +----I----+---MVp---+      
    +---Qd---+-SIp-+   |  +--Xd--+--Xc-+
    |        |     |   |  |      |     |
LEFT-WALL shall.v we go.v , than.#then ? 

Cool! I did not realize that!

The all_of example will not work this way, because it is supposed to be tokenized into two words. It does not mean that the idiom all_of appears in the dictionary. Let me think about this some more.

linas commented 7 years ago

Re: post-processing: Let me explain it this way: if there is some optional "post-processing" utility that maybe tags words with some additional info, or does some other light-weight processing, that is fine. What I do NOT want is some complex heavy-weight mechanism that interacts with parsing, or does complex filtering, like the current post-processor. Here's why:

The abstract theory of link-grammar is very similar to, almost identical to the concept of a "pregroup grammar" (see wikipedia) or a "categorial grammar" or a non-symmetric monoidal category (of tensor products), etc. The "of tensor products" means that one can attach a cost system that is distributive the same way that tensoring is -- you can pull out the cost as an overall factor. This concept is important, because it makes the costs look like a Markov model -- or like a hidden Markov model, where the links between words are what is "hidden", and the costs are the Markov chain weights. To summarize: this way of viewing the parsing process makes it fit very naturally into a generic theory of Markov networks, as well as fitting into various other generic theories of grammar. So I want to very much preserve this naturalness (in the sense of the wikipedia article on "naturalness" as a "natural transformation" between categories).

The current link-parser post-processor destroys naturalness. I don't want any other post-processors that destroy naturalness. Of course, it is possible to have a post-processor that is natural, so ... that would be OK.

linas commented 7 years ago

So: pull request #425 implements some basic typo support. Things that don't work (can't work without wordgraph support):

replacing there by they're, replacing all by all_of

ampli commented 7 years ago

There is a slight problem that it makes "bad sentences" parsable without a way to switch this feature off (like it is possible with spell guessing).

A way to load sub-dictionaries on demand can solve that, but it seems an overkill to implement it only for that.

linas commented 7 years ago

well, instead of sub-dictionaries, the problem would be solved by "dialect support" - #402 : turn off the "bad speling" dialect, and then these rules no longer apply.

ampli commented 7 years ago

Maybe in some ideal world, we should allow the dictionary to contain something like foo: BAr+ or <spell-guess>;

I would like to fix the "idiom problem", and since this fix touches common code with the implementation of the above, I would like to implement your proposal for spell guess dictionary markers. For efficiency reasons, I propose to check these markers only for "regex words".

linas commented 7 years ago

Huh. So the proposal is that there are some "special" words, such as <spell-guess>, which cause a C function to be called, and that C function takes ... foo, the word, as an argument, and returns one or more words? What does this have to do with the Russian "idiom problem", which seems to be due to a broken Russian dictionary? (i.e. its missing support for numbers?)

ampli commented 7 years ago

The Russian "idiom problem" is not so specific. A similar general problem exists for words which can only participate in idioms and have no current dictionary meaning otherwise (such as "fossil words"), or words which participate in idioms but their lookup -in the case they are not in the idiom - is to be done by regex (e.g "catch 22").

The current tokenize algo (as the original one) doesn't make word lookup in more than one method - it stops after the first lookup succeeds. For example, supposing the dict doesn't already include the word '22', then "he found 22 dollars" couldn't be parsed in the presence of "catch_22" in the dict.

A possible fix for the problem mentioned above is to define a new function abridged_boolen_lookup() that doesn't look at idioms, and use it as the lookup function of the dict instead of boolen_lookup. In this manner, the '22' in the example above would get a regex lookup, because it will not be found in the dict by abridged_boolen_lookup(), which is invoked by both boolean_dictionary_lookup() and find_word_in_dict(). As a result of that, at the end of the tokenizing stepdetermine_word_expressions() will have to do two lookups for '22' - direct dict (for the idiom-related disjuncts) and regex. This process may not be ideal but it may be a start.

In the case of a regex tag which has <spell-guess>, at the end of the tokenizing step determine_word_expressions() will have to do regex and also direct dict (if there were spell suggestions) lookups.

I first thought there will be a need to change determine_word_expressions() because it currently cannot do more than one lookup per word. This is the common code I thought that need a rearrangement. However, actually there is no need for that because each lookup is to be added as an alternative.

I think that all of that indicates a much deeper problem. For example in this is a ttests the word ttests will not be a subject to spell suggests because it matches the S-WORDS regex (so the result is LEFT-WALL this.p is.v [a] ttests[!S-WORDS].n instead of LEFT-WALL this is a test[~]. So ideally all the possible lookups have to be done, not only the first one that succeeds. It is the job of higher layers (after the basic parsing) to find which parsings are more appropriate. Additionally, a kind of "post processing" (which has nothing to do with the current "post processing") can filter out "failed tries" - instead of filtering them out a-priori as done now by the first-fit token lookup algo (I never understood how filtering a-priori can be considered a better idea).

ampli commented 7 years ago

Similar examples are when a word is a subject to a spell correction, but none of the suggested words is linked, or a word is subject to a regex when the result is no linked (there are more examples like that). In that cases it is beneficial to use the unknown-word device, but the current algo doesn't use it if an earlier lookup succeeds.

linas commented 7 years ago

Ah! OK, yes, very good. Right, the a-priori, use-the-first-one-found approach is not sufficient. Yes, for any given input, we should try all three -- spell-guessing, regex, and standard lookup.

The cost system is that thing which is supposed to help eliminate "failed tries" -- the failed tries are the ones with the highest cost. The cost system has both a-priori costs and parse costs. The a-priori costs are that regex and spell should have higher costs than hits in the (abridged) main dictionary. The parse costs are that some word combinations are just unlikely. Both are supposed to work together to find the most-likely interpretation.

The "dialect" proposal is supposed to help fine-tune this, by allowing the relative weighting of the regex, spell, and other approaches to be adjusted in a simpler and more direct fashion.

ampli commented 7 years ago

As part of handling word location (issue #420), I wanted that the code will account the "i" capitalization of the TR and AZ utf-8 locales (lowercasing a capital dotted "İ" shortens the token by one character) without introducing unneeded overhead. The current code needs a slight restructuring to provide the needed info, and this also solves the problem in the subject of that issue (spell-guessing mis-handles capitalized words).

(On GitHub, click to expand.)

Currently, when the code doesn't succeed to split a token, it lowercases it and tries again. There is no way in the current code (without extreme hacks) to mark the word with the WS_FIRSTUPPER flag, a thing that is needed in order to later account for a possible token length change without extra overhead.

So I removed this said lowecasing. Instead, in another place I "issued" (to the wordgraph) the lowercase version of the token unconditionally (instead of only if it is in the dict), marking it with WS_FIRSTUPPER as needed for the word position code.

Because the lowercase version of the token is now in the wordgraph, this fixes the said spelling problem.

However:

The point is, I don't want to run roman numerals through the spell-guesser, since I already think I know what they are.

This I have not implemented yet. Preventing any spell-guess on the lowercase token in case there is a regex-guess on its original uppercase form, is easy. But as you pointed out regarding roman numerals, it can be a good idea to control it.

Possible solutions:

  1. As you suggested: \ marking in the dict for the regex tokens for which we would like to perform a spell-guess too. This allows to add cost to it, but I don't know how to handle such a cost (no internal API for that).
  2. As a flag in the regex file. This has the benefit of better modularity. Cost can be added too. But this option adds extra syntax (should be an obstacle).
  3. Always perform spell guessing for ~regex-guessed~ unknown lc'ed words with known uc versions, but programmatically give them a higher cost (again, no internal API for assigning cost to X-nodes). (EDITED)

Please tell me what to implement.

(In any case, it could be nice if the spell-guess cost could relate to the edit-distance of the fix, which we can compute.)

Examples for the current unrestricted guess-corrections:

This adds a likely guess (the 3rd one).

linkparser> II'm here
LEFT-WALL I[~].p 'm[~] here 

linkparser> II said it.
LEFT-WALL it[~] said.v-d it . 
LEFT-WALL II[!ROMAN-NUMERAL-WORDS].n said.v-d it . 
LEFT-WALL I[~].p said.v-d it .

This ads a really bad guess (maybe I.p would be more likely, but the current dict cannot handle it).

linkparser> II. Take the lid off.
LEFT-WALL II[!ROMAN-NUMERAL-WORDS].rn ..y take.v the  lid.n off . 
LEFT-WALL I[~].id ..y take.v the  lid.n off . 

In this case there is no spell-guess try at all:

linkparser> He said II said it
LEFT-WALL he said.v-d II[!ROMAN-NUMERAL-WORDS].n said.v-d it 

This is because currently a lowercase token is not generated for capitalized tokens that are not in capitalized positions.

Finally, a problem I found that we have not thought about: Nonsense guesses for all-capitalized tokens. For example:

linkparser> BA students 
LEFT-WALL BA[~].l students.n 
LEFT-WALL AA[~].l students.n 
LEFT-WALL SA[~].l students.n 
LEFT-WALL be[~].v students.n 

Note that BE is in the dict. So why it is corrected? This is because it lowercased to bE. So in order to prevent at least such nonsense corrections, I propose to not lowercase tokens with more than one capital letters.

ampli commented 7 years ago

Spell-guesses have many other problems (that I encountered), some we have never discussed. For example, the guessed words may have no linkage. I think that such linkages can be filtered out as a kind of "bad".

ampli commented 7 years ago

Ah! OK, yes, very good. Right, the a-priori, use-the-first-one-found approach is not sufficient. Yes, for any given input, we should try all three -- spell-guessing, regex, and standard lookup.

For regex we already had such test code (the "parallel regex" feature) that I removed because it added a lot of false positives (may be mainly false positives), especially for Russian. I can bring this code back.

Maybe we can add a parse option for "more aggressive guesses"?

The "dialect" proposal is supposed to help fine-tune this, by allowing the relative weighting of the regex, spell, and other approaches to be adjusted in a simpler and more direct fashion.

How? Especially, how it can work for spell-guesses?

ampli commented 7 years ago

For regex we already had such test code (the "parallel regex" feature) that I removed because it added a lot of false positives (may be mainly false positives), especially for Russian. I can bring this code back.

It turns out the "parallel regex" feature is not the desired feature here, as it referred to making a regex guess on words that can also split, in preparation to cases in which this split words have no linkage. But usually they have (in a good dictionary), so this just adds noise.

In any case, I already have a ready version that I can send as a PR.

I said:

Possible solutions: 1) As you suggested: marking in the dict for the regex tokens for which we would like to perform a spell-guess too. This allows to add cost to it, but I don't know how to handle such a cost (no internal API for that). 2) As a flag in the regex file. This has the benefit of better modularity. Cost can be added too. But this option adds extra syntax (should be an obstacle).

  1. Always perform spell guessing for ~regex-guessed~ unknown lc'ed words with known uc versions, but programmatically give them a higher cost (again, no internal API for assigning cost to X-nodes). (EDITED)

I forgot to mention : 4) Never perform spell guessing for unknown lc'ed words with known uc versions (including known due to regex guess).

This is what I implemented for now. Anything else we discussed can be implemented as an extension.

ampli commented 7 years ago

The concept of spell-guess on unknown capital words is still problematic, and the cost system doesn't usually help in that. Here is an example from corpus-voa.batch:

Current GitHub (3a11762be33cc2e511f8e319592788c6f5b038a9):

linkparser> Shel Silverstein once said that he wanted to go everywhere, look at and listen to everything.
No complete linkages found.
LEFT-WALL Shel[!] Silverstein[!] once.e said.v-d that.j-c he wanted.v-d to.r go.v everywhere , look.v [at] and.j-v listen.v to.r everything . 

After the said change:

LEFT-WALL shell[~].v Silverstein[!] once.e said.v-d that.j-r he wanted.v-d to.r go.v everywhere ,  look.v at and.j-v listen.v to.r everything . 
LEFT-WALL shed[~].v-d Silverstein[!] once.e said.v-d that.j-r he wanted.v-d to.r go.v everywhere ,  look.v at and.j-v listen.v to.r everything . 

Because it now has a full linkage, "Shel[!]" doesn't appear any more in the results. So the result of the current code is much better in that case.

EDIT: Fix showing the of original sentence. Fix truncated results.

ampli commented 7 years ago

Linas, I still need your comments on this issue.

linas commented 7 years ago

I just unearthed this in my email box. I'll try tomorrow; keep reminding me.

ampli commented 7 years ago

Please see it on GitHub, because I edited it just after my posts, to fix errors.

linas commented 7 years ago

I read through this, and see no easy solution. I want to put this on the back burner for another few months or more.

Adding costs to alternatives will become very important, soon ... I expect that language learning will reveal that some splits are far more likely than others, and this should be indicated with costs.

I have a capitalization problem for the language learning, I don't yet know how to solve that.

ampli commented 7 years ago

I want to put this on the back burner for another few months or more.

I just would like to comment that the current library usage gives "hidden" priority to linkages with no null-words, by not showing the ones with null-words at all if there is a linkage without them, thus avoiding the opportunity to give null-words relatively less cost when desired. Sometimes (like here) a linkage with null-words makes more sense.

I have a capitalization problem for the language learning, I don't yet know how to solve that.

Is there a statistical+logical way to infer that certain letters are likely "the same", e.g. A and a? For example say we have a rule that "simpler things are more likely", and a metric for "simpler things" that becomes higher if we assume certain letters have a similar function to others.

(In Hebrew there is a similar problem, regarding certain letters at end of words.)