opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
389 stars 118 forks source link

Get rid of max-cost #1453

Open linas opened 1 year ago

linas commented 1 year ago

There is a long, old discussion about max-cost in #783 which suggests that getting rid of the concept of max-cost is a good idea. Newer evidence, in #1449 confirms this.

My memory of max-cost was that, originally, it was related to panic-parsing, and that setting it to a higher value would somehow make panic parsing "find something, anything". But this seems untrue. In the English dict, there is only one thing with three brackets: [[[@Xc-]]] that would be chopped by max-cost. (There is nothing with [FOO+]3.0 because the panic-parse stuff predates the floating point costs.)

Rather than continuing conversations in the already-long #783, lets just do it here. Or rather, lets just .. get rid of it, and be done with it.

As a side-effect, maybe get rid of panic-parse, also. Maybe.

ampli commented 1 year ago

one thing with three brackets: [[[@Xc-]]]

There are also things that get a cost >= 4 through 4.0.dialect, the first one had originally 4 brackets:

  1. <no-det-null>: [()]headline;
  2. then.#than: [[than.e]0.65]bad-spelling; etc.
ampli commented 1 year ago

get rid of it

  1. Do you mean that we should get rid of parse_options_set_disjunct_cost()?
  2. What is so bad about this concept?
  3. Without a disjunct cost limit, how do you implement dialects?

As a side-effect, maybe get rid of panic-parse, also. Maybe.

The current panic parse parameters are indeed not useful to find a parse in a shorter time. But its implementation infrastructure may be used to find a full parse in case a full parse has not been found (disregarding the parse time, by using new facilities that take more time). But the details may be considered as not belonging to this discussion...

linas commented 1 year ago

we should get rid of parse_options_set_disjunct_cost()?

Yes. For backwards compat, it would be a no-op.

What is so bad about this concept?

The discussion and confusion in #783 and about what Dennis had done indicate that its a poorly formed concept, at least for the English dict. The English parses seem to work better, if it is set to infinity.

For the Atomese dicts, it is perhaps ... not such a bad concept, as there, some outrageously large costs show up, and could be discarded.

Without a disjunct cost limit, how do you implement dialects?

!? Hmm. Interesting question. I'm not sure it's a good idea to turn dialects on and off, based on some cost threshold.


OK, how about a proposal B:

linas commented 1 year ago

Arghh. Why is this so hard? I just set #define max-disjunct-cost 8.7; in the English dict, and now corpus-basic has 102 errors instead of 88 ... clearly, there is something going on that I do not understand. Its like .. the more I use it, the less I understand the subtle interactions. This is frustrating.

linas commented 1 year ago

OK, here's what happened. After removing costs, this sentence from corpus-basic parses:

*He is the character of person who would do that

Why? Because person.n 4.000 Js- R+ Bs+ How? Because person.n is in words/words.n.1-const and

words/words.n.1-const: <common-const-noun>

and

<common-const-noun>: <no-det-null> & <noun-rel-s> & <noun-main-s>;
<no-det-null>: [()]headline;
<noun-main-s>: Js-;
<noun-rel-s>: {R+ & Bs+};

and apparently headline must be 4.0. I conclude that the correct way to disable dialects is to set the cost to 500 ...

Doing this gives

Error: en/4.0.dialect:54: After "headline": Invalid cost "500".

Then utilities.c has strtofC() ... why the heck is this so complicated? What's wrong with the <stdlib.h> version of strtof() ??? What the heck? Is this some crazy Windows or Apple nuttiness?

Setting the headline dialect to 99 allows corpus-basic to parse with only one additional error. That error is due to [[[@Xc-]]] on commas.

I will remove [[[@Xc-]]] in just a few minutes. I will change dialects to have a cost of 99 when disabled, soon after that. And as to strtofC() I'm controlling a very strong urge to nuke that, but you probably have some valid reason to keep it ...

linas commented 1 year ago

So strtofC() appears in #1231 ... it's due to locales. Friggin locales. My impression is that, in unix, they were mis-designed. I have never had a situation where locales made something easier, instead of more complicated.

ampli commented 1 year ago

The discussion and confusion in https://github.com/opencog/link-grammar/issues/783 and about what Dennis had done indicate that its a poorly formed concept, at least for the English dict.

Have you looked again at the document I referred you to there, that talks about the reason for such a cost computation? (I'm not saying its author is right.)

The English parses seem to work better, if it is set to infinity.

Does it include the "headline" parsing? As you commented out in the old dict, it was commented out because allows often bad passing. But with "infinite cost" it will be enabled permanently. But on the other hand, if you remove it, how one could enable headline parsing (that you may want to try if there is no parsing otherwise)?

!? Hmm. Interesting question. I'm not sure it's a good idea to turn dialects on and off, based on some cost threshold.

Just now they can be turned on/off individually just by mentioning their name. Why is it important how it is internally implemented? Is using it as a binary expression switch better?

OK, how about a proposal B:

  • Keep the cost_cutoff idea

Of course, it seems to me fine...

  • Get rid of the cost_max computations in the code

I guess this may be fine too. But you may need to adjust some dictionary costs accordingly. It would even slightly speed up build_clause() because it would be able to immediately skip clauses that get over cost_cutoff (currently it cannot because the total cost may be reduced when it proceeds, so it discards such disjuncts only at the end).

  • Increase English max to 5 or 10 or unbounded.

What should we do with headline and bad-spelling, which would get enabled then if we do nothing?

ampli commented 1 year ago

The increase in cost cutoff is intended for more sentences would get parsed. However, when sentences marked with * are then getting parsed this is considered an error!

And indeed:

$ link-parser < data/en/corpus-basic.batch > /tmp/b0
...
88 errors.

$ link-parser -cost-max=8.7 < data/en/corpus-basic.batch > /tmp/b1
...
102 errors.

$ systest/lgerror /tmp/b0 /tmp/b1
<  *A tax on income increase may be necessary
<  *Alfred Baird, man I know, has been described as good gardener
<  *But my presents to win his heart have failed
<  *Clinton is expected to return to Thursday Washington office
<  *He is more black to go than to stay
<  *He is more likely than to stay
<  *He is the character of person who would do that
<  *Joan Smith is tourist
<  *Many Croat who had fled their homes are now returning to them
<  *Many people attended that they spilled over into several neighboring fields
<  *Many person were angered by the hearings
<  *That is the man, in Joe's opinion, we should hire
<  *The data on project will be used for the file at program
<  *Until initially, these fossils were believed to belong to different species
ampli commented 1 year ago
Error: en/4.0.dialect:54: After "headline": Invalid cost "500".

Then utilities.c has strtofC() ... why the heck is this so complicated? What's wrong with the <stdlib.h> version of strtof() ??? What the heck? Is this some crazy Windows or Apple nuttiness?

This is very simple. strtof() is locale depended - this is mandatory by POSIX. It means you need to temporarily change the locale to C on each call and return to the previous locale. This is not thread friendly (it changes for all threads). I fixed it by introducing my function, which also has the slight benefit of being slightly faster.

The 99.9999 is just an arbitrary limit and I can increase it to 999.99 (if you do it, change max_int_digits to 3).

BTW, internally, ~10000 costs are used for dialect handling. From dialect.h:

link-grammar/dict-common/dialect.h: #define DIALECT_COST_MAX         9999.0F    /* Less than that is a real cost */
link-grammar/dict-common/dialect.h: #define DIALECT_COST_DISABLE    10000.0F
link-grammar/dict-common/dialect.h: #define DIALECT_SUB             10001.0F    /* Sub-dialect (a vector name) */
link-grammar/dict-common/dialect.h: #define DIALECT_SECTION         10002.0F    /* Section header (a vector name) */
linas commented 1 year ago

Have you looked again at the document I referred you to there

This one? https://www.semanticscholar.org/paper/Predicate-Expression-Cost-Functions-to-Guide-Search-Bottaci/52f5feca45ea50dc2eaba649a885ff66af575234 It does not seem to be relevant to our situation. I'm not sure if I should provide a long reply to this, but you seem to enjoy long replies, so what the heck.

That paper is about evolutionary search for test data. It seems to describe what we now call "program fuzzing". Although that paper uses the terms "disjunct", "conjunct" it does not seem to be relevant to what we do with LG.

Let's review the concept of natural language parsing.

The above was the situation in 1995. Since then, some new ideas:

At the same time, there are other forces at work:

The current questions are:

My personal answer to the second question is "sure, maybe yes, probably, I guess so, lets try it. eff around and find out". My personal mental picture of the situation is that it looks kind-of-like Morse theory, except in extremely high dimensions. Each human is coming at grammar from a slightly different direction. Each bifurcation is driven by phonetic discrimination, by ability or inability to understand and articulate.

Great. What does this have to do with cost-max, and the paper you reference?

Well, the disjuncts and conjuncts that LG uses are a fragment of what is called "linear logic": its a fragment, because some of the logical operators of linear logic are missing. Also, LG is not left-right symmetric; it is instead an associative algebra, and so resembles that of a pregroup grammar https://en.wikipedia.org/wiki/Pregroup_grammar -- and the resemblence is not accidental; LG appears to be isomorphic to categorial combinatorial grammar. See https://github.com/opencog/atomspace/blob/master/opencog/sheaf/docs/ccg.pdf for details.

What should the cost of.... OK, next comment. The above is too long. I don't know why I wrote it. You already understand this stuff.

linas commented 1 year ago

Some cost questions:

The second question does not make sense, because the two terms results in two different parses. Any suggestion from the Bottaci paper does not apply here.

For the first question, we have five possibilities. The cost of [A-]0.3 & [B+]0.4 is the same as [A- & B+]f(0.3, 0.4) where

The answer to this question serves two purposes:

If the preferred theory is information theory and that costs are log-probabilities, then f(x,y) = x+y seems like the only reasonable answer.

I cannot think of any good reasons why other formulas could be used, or should be explored. There is nothing in the Bottaci paper that seems apropriate for our current situation: Bottaci is talking about logical operators and DeMorgan laws; the LG operators are NOT logical operators, and they do NOT obey DeMorgan's laws. Bottaci is talking about statistical sampling; we are not doing statistical sampling....

I don't know where else to go with this. Yes, tiny details like this can sometimes be very important; but here, I just don't see it. Looks like a cul-de-sac to me.

ampli commented 1 year ago
  • Costs vaguely resemble minus the logarithm of a probability, and so can be understood as providing a likelihood of a parse being correct.
  • In this case, cost should be fractional, floating point. This provides a foot-hold for ideas from information theory to enter the domain of parsing.

BTW, I think that the LG system should also have another type of cost, an integer one that is applied "locally" to disjuncts during tokenization and not summed up (using log or any other function) over the sentence. I tried to make my point several times in the last few years but you have never told me your opinion on that.

because some of the logical operators of linear logic are missing.

But a common thing is that all the connectors of a disjunct must match for the disjunct to be considered matched.

If the preferred theory is information theory and that costs are log-probabilities, then f(x,y) = x+y seems like the only reasonable answer.

I agree that this is preferable for combining the costs of disjuncts. As far as I know, the LG code also computes the disjunct cost by adding together the costs of its connectors.

Here the question is what is the meaning of the cost cutoff - is it applies to disjuncts or connectors? If to connectors, then the max function results. But this has a property that looks improper - disjuncts that have a "high" cost are participating.

linas commented 1 year ago

BTW, I think that the LG system should also have another type of cost, an integer one that is applied "locally" to disjuncts during tokenization and not summed up (using log or any other function) over the sentence. I tried to make my point several times in the last few years but you have never told me your opinion on that.

My apologies; Maybe I did not notice that you were making a proposal, or asking a question. Maybe I did not understand what you were saying. Maybe I was not able to develop a strong opinion, one way or the other. Maybe I was in a hurry, and just forgot to respond. Open a new discussion/issue to clearly articulate the idea.

I still might not respond, and I might not respond clearly. My general attitude will be "does this make sense from an information theory perspective?" Also, as conversations with Anton Kolonin @akolonin over the last year have made clear, we do not have any good high-quality theories or systems for text segmentation. This includes segmentation of Chinese, splitting of Indoeuropean languages into morphemes+suffix, detection of sentence boundaries, etc. And without some clear, good theory on how this can work, it will be hard to decide if this or that modification to LG tokenization is a good idea.

because some of the logical operators of linear logic are missing

What I meant is that the upside-down ampersand is missing, the question-mark is missing, etc. https://en.wikipedia.org/wiki/Linear_logic and https://plato.stanford.edu/entries/logic-linear/ -- linear logic describes tensor algebras; natural-language is tensor-like, but is not the full thing (LG connectors can be understood as tensor indexes, and LG disjuncts can be understood as tensors. However, not all of the required axioms for tensors apply.) I find this interesting but its another cul-de-sac.

Here the question is what is the meaning of the cost cutoff

I got tired of agonizing over this, and merged #1456 -- that thing did not help English in any way, and it was not needed for Atomese, so I just got rid of it.

linas commented 1 year ago

As of this time, after merging the last few pull reqs #1456 and #1455, the code base implements "proposal B" of comment https://github.com/opencog/link-grammar/issues/1453#issuecomment-1444452316