opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
388 stars 118 forks source link

Get rid of `null_count>0` parsing #1449

Open linas opened 1 year ago

linas commented 1 year ago

The proposal, in short: get rid of all support for parsing with null words. It seems to be complicated, requiring extra checks and tests in all sorts of delicate location. It adds complexity to the code. Perhaps it should be gotten rid of?

The replacement is to extend/alter the ZZZ-link mechanism to handle the case of unlinkable words. It seems there are several ways to do this:

1) If there are no parses, then rebuild the expressions for the words, and automatically add ({XXX-} & {XXX+}) to every expression, and then re-parse.

2) Always add ({XXX-} & {XXX+})[2.5] to every expression, even from the very beginning. This would avoid the need for a second parse, in all situations: there would always be some linkage, no matter what.

Both of these have problems with cost:

Both of these might (?) be solvable with a minor redesign of how cost-max (aka cost_cutoff) actually works. This comment about Dennis's code:

https://github.com/opencog/link-grammar/blob/4672e07635c0c1772e59f806862cd31c17df4d07/link-grammar/prepare/build-disjuncts.c#L204-L218

I think maybe Dennis had it right? That is, if we set:

c1->cost += e->cost;
c1->maxcost = MAX(c1->maxcost,e->cost);

then, yes, it will often be the case that max-cost will be less than cost. But that is OK. It will allow expressions such as A-[1.6] & XXX+[2.5] to have a grand total cost of 4.1=1.6+2.5 and thus get ranked appropriately, while maxcost will be just 2.5 which is still below the cost_cutoff of 2.7, and thus gets included in the parse. This would allow option 2) to be used, in place of null-counts.

Does this make sense? @ampli do you like this? I think this would even be an easy experiment to make.

FWIW, I like this idea, because it is effectively what I am already doing in the atomese dict: either I am very certain of a disjunct (so it has a low cost) or I don't really know, so I assemble one from word pairs (at a higher cost) or I really don't know (e.g. unknown word) in which case a random high-cost link is allowed.

Long-term, there is another issue: the interplay with cost-max and combinatorial explosions. It would be nice (nicer) to have "flood-level" counting: slowly raise cost-max, until the count grows greater than zero. But I don't know how to implement this efficiently/automatically (See however, issue #1451)

linas commented 1 year ago

I just tried the experiment.

--- a/link-grammar/prepare/build-disjuncts.c
+++ b/link-grammar/prepare/build-disjuncts.c
@@ -205,13 +205,13 @@ static Clause * build_clause(Exp *e, clause_context *ct, Clause **c_last)
        for (Clause *c1 = c; c1 != NULL; c1 = c1->next)
        {
                c1->cost += e->cost;
-               /* c1->maxcost = MAX(c1->maxcost,e->cost);  */
+               c1->maxcost = MAX(c1->maxcost,e->cost); 
                /* Above is how Dennis had it. Someone changed it to below.
                 * However, this can sometimes lead to a maxcost that is less
                 * than the cost ! -- which seems wrong to me ... seems Dennis
                 * had it right!?
                 */
-               c1->maxcost += e->cost;
+               /* c1->maxcost += e->cost; */
                /* Note: The above computation is used as a saving shortcut in
                 * the inner loop of AND_type. If it is changed here, it needs to be
                 * changed there too. */

gives no change for corpus-basic.batch, although one linkage changes.

For corpus-fixes.batch the number of errors drops from 372 to 366 and the new allowed parses are better than the null-word parses!

I don't know but what.
Then what?
This I did not know!
Oh God, then what?
Then what?

This says Dennis was right, and we've been carrying a cost-max bug for a decade!

The corpus-failures.batch drops from 1512 errors to 1505 errors.

The pandp-union.txt drops from 1829 errors to 1791 errors

I have a new russian text, that you don't, some WWI vainglorious suffering; it drops from 156 errors to 152. Should this, and maybe pandp, be checked into git?

linas commented 1 year ago

If #1450 is merged, then we could always add ({XXX-} & {XXX+})[2.9] to every expression, and then, if there are no parses, just change cost_cutoff from 2.7 to 2.9.

The more I think about this, the more I like it. I'm infatuated with it, now.

ampli commented 1 year ago

See the long discussion in the still-open issue #783.

I don't remember (and didn't compare now) if my proposed change is the same as your change here. In any case, looking at the total error count may not tell the full story if some previously-parsed sentences became unparsable. To see the full error differences, you can use systest/lgerror run1.out run2.out (on outputs of link parser using the same batch file).

ampli commented 1 year ago

Does this make sense? @ampli do you like this? I think this would even be an easy experiment to make.

In issue #783 I referred to an article that uses an identical cost logic like the original (current) code. We also discussed (maybe elsewhere) your proposed method of parsing with nulls, and the conclusion was we might have a problem with a combinatorial explosion.

Long-term, there is another issue: the interplay with cost-max and combinatorial explosions. It would be nice (nicer) to have "flood-level" counting: slowly raise cost-max, until the count grows greater than zero. But I don't know how to implement this efficiently/automatically (See however, issue #1451)

I will answer that there.

ampli commented 1 year ago

Please indicate if I missed referring to anything else here.

ampli commented 1 year ago

2. Always add ({XXX-} & {XXX+})[2.5] to every expression, even from the very beginning. This would avoid the need for a second parse, in all situations: there would always be some linkage, no matter what.

One problem is that we will get the parses of any null count, from 1 to (sent->length-1). Due to the exploding number of linkages, samling will be needed. Even if there is a linkage with one null, you could never sample enough links to be sure, unless you encounter it by chance. And even if you encountered one, it may be very hard to find additional ones with different null words (or even to find out that they exist). In principle do_count() could account for the number of used XXX links, but then we have again the same algorithm as parsing with nulls.

I strongly suspect that your proposal is equivalent to the current algo of parsing with null links if you just remove from it the null links counting. You will than get all the linkages with any number of null words, with the exact same complexity. If you look in the original payper that talks about parsing with nulls, it detailed how null links are equivalent to implicit added links to each words that ensure a linkage.

linas commented 1 year ago

OK, so #783 is very long and it will take me a while to digest it.

In the meanwhile: you seem to be saying that I'm just reinventing null links. And perhaps that is true. However, grep -r null_count * |wc says that there are 221 lines that have null_count in them, which implies a lot of extra complexity in the code for null-linked words. Getting rid of this would make future changes simpler, (in https://github.com/opencog/link-grammar/pull/1446#issuecomment-1441397457 you mentioned "Add match-list cache support for null_count>0 (if turned out not to be trivial,...)" which means: doing anything with null_count>0 is non-trivial. So lets just get rid of it!)

Due to the exploding number of linkages, sampling will be needed.

Unless we find a way of enumerating linkages in order, according to #1451 ... if that could be done, then it would not matter how many linkages there are, because we could always get them in sorted order, without sampling. The modification there, to list_links seems like it could work, if I had available the total costs on the left and right, so that I'd know which tree path to step to next.

Even if there is a linkage with one null, you could never sample enough links to be sure, unless you encounter it by chance. And even if you encountered one, it may be very hard to find additional ones with different null words (or even to find out that they exist).

There is no problem, if linkages can always be accessed in sorted order. If there is one null link, then cost determines which word is the null-linked word.

linas commented 1 year ago

Get rid of islands, too: grep -r island * |wc says 40

ampli commented 1 year ago

In the meanwhile: you seem to be saying that I'm just reinventing null links. And perhaps that is true.

Yes.

However, grep -r null_count * |wc says that there are 221 lines that have null_count in them, which implies a lot of extra complexity in the code for null-linked words. Getting rid of this would make future changes simpler

Not at all. The null_count denotes the minimal null-linked word count in the sentence that is needed in order to parse it. It is still needed, as it is also the basis of the work of sane_linkage_morphism() and optional words (that are needed for morphology handling). Else you need to redo anything using the old "empty-word" idea. All you will get is calling null_count by another name, say "XXX-linked", so you don't simplify anything.

The "complexity" of the null_count handling in do_count() is only for the purpose of finding the *minimal null_count needed for parsing. If you remove this code you will parsing results for any number of nulls (1-sent_length], which will take much more time and memory than for parsing with a fixed null_count (for sentences of several tens of words, I estimate this by several tens-fold of CPU and memory). And on link extraction, if you would then want to find out the minimal number of null words that allow parsing, you would need to reimplement the same algo that you removed from do_count() but now it would be more complex code, and also much more time-consuming (due to the bigger data structures).

So I cannot see why you may want it.

BTW, There is also null_cnt in do_count() that adds to your count. But many of its occurrences there (and null_count assurances in mk_parse_set() are redundant and appear only due to the way the code is written. For example, the parsing of the LHS and RHS in both do_count() and mk_parse_set() or done by a similar code, while it could be done by calling a function that encapsulates the common code (the disjunct definition then would better need to include Connector *c[2] instead of or in addition to Connector *left, *right;). The rest of do_count() calls can also be done through a function that will remove visible code and may make the COUNT/CACHE_COUNT macros needed. As a side effect, this will cosmetically reduce the number of null_count/null_cnt. However, for now, I would like to devote my time to finishing the old WIPs that are waiting so much time (some need to be redone because they cannot be merged anymore, e.g. they use the old Exp_struct).

ampli commented 1 year ago

Get rid of islands

As you may recall, we need islands for implementing your phantom-word handling idea (for which it seems to magically function).

ampli commented 1 year ago

which means: doing anything with null_count>0 is non-trivial

It seems trivial - just add an array subscript. But I have not implemented it yet, so hence is my comment.

So lets just get rid of it!

But it is absolutely needed, since your proposal will not work to find the parsing with the least nulls, and the parse with nulls would be many times slower.

There is no problem, if linkages can always be accessed in sorted order. If there is one null link, then cost determines which word is the null-linked word.

The problem is that you extract linkages in sorted order according to any combination of one and more of [disjunct-cost, unused-words, link-length], by any local order in the parse-set/parse-choice graph.

linas commented 1 year ago

which will take much more time and memory than for parsing with a fixed null_count (for sentences of several tens of words, I estimate this by several tens-fold of CPU and memory)

The back-of-the-envelope calculation for this is that, if one adds a single optional XXX connector to every expression, this has the effect of doubling the number of disjuncts (for that word) If a sentence has $N$ words, this would increase the number of disjuncts by a factor of $2^N$ -- so even if pruning worked very very well, many of these would still sneak by and use more RAM and CPU.

OK, I can accept that argument. (And I'll close this issue)

However, my thinking still runs in these directions

About this last bullet -- lets think about that, some more -- it leads to the possibility that a parse with zero null words might actually be more costly than a parse with one null word, plus the cost of the XXX link.

Is it worth the effort to try this?

ampli commented 1 year ago

it doesn't make sense to try to hand-write dictionaries any more.

Right.

What can one do with these dictionaries, that cannot be better done in some other way?

I can think of:

  1. Getting deterministic and explainable results.
  2. Getting fine-POS tagging. But most probably a GPT system can do it if trained for that. (BTW, the current ChatGPT hallucinated when I asked it to do link-grammar parsing.)

So, the Atomese dictionaries are written so that a null-linked sentence can never occur: instead, a random ANY link will be used to connect to such unlinkable words. (The ANY is equivalent to the XXX link, above)

But if you provide such ANY links, even a sentence that has a full parse will also have plenty of parses that involve these ANY links. Is it useful?

  • I would have to write brand new code to modify existing disjuncts, after parsing, and insert the ANY links in appropriate places. Yuck. I'm sick and tired of writing complicated code.

Suppose that after a parse that gets null words, we add ANY links to all the chosen disjuncts of all the linkages. Will we get something useful?

Is it worth the effort to try this?

I really don't know. But I think that it should be tried first w/o modifying the master repository branch.

linas commented 1 year ago

But if you provide such ANY links, even a sentence that has a full parse will also have plenty of parses that involve these ANY links. Is it useful?

The ANY links have a high cost. Full parses would have much lower costs, and so ranking would eliminate sentences with ANY links. For the current English dict, a cost of 3.0 or 3.5 or 4.0 seems like a good choice. Slightly larger than almost all existing costs in the dict.

However, there is still a possibility of a "cost inversion", where a parse with a single ANY link is cheaper than the cheapest parse without an ANY link. This would be unlikely, but perhaps some weird constructions would cause this. It could be used as an indicator of what a "better" collection of disjuncts could be.


Note this curious interplay between null-words & islands.

linkparser> It is the thing the answer
Found 24 linkages (24 had no P.P. violations) at null count 1
    Linkage 1, cost vector = (UNUSED=1 DIS= 0.10 LEN=13)

               +-----------Osm----------+
    +--->WV--->+   +--------Ds**v-------+
    +->Wd--+-Ss+   |     +------AN------+
    |      |   |   |     |              |
LEFT-WALL it is.v the thing.n [the] answer.n
linkparser> !isl
Use of null-linked islands turned on.
linkparser> It is the thing the answer
No complete linkages found.
Found 28 linkages (28 had no P.P. violations) at null count 1
    Linkage 1, cost vector = (UNUSED=0 DIS= 1.00 LEN=12)

    +--------------->Wa---------------+
    |          +---Osm---+            |
    |      +-Ss+   +Ds**c+     +--Dm--+
    |      |   |   |     |     |      |
LEFT-WALL it is.v the thing.n the answer.n

To full connect the first parse, there are two possibilities:

thing.n ----ANY---- the

or

the  ---ANY--- answer

both which give a total cost of 4.1 (assuming ANY costs 4.0) For the island-parse, a full linkage is possible with either

LEFT-WALL ---ANY--- it

or

thing.n ---ANY--- the

either of which give a total cost of 5.0.

There are probably sentences for which the cost of the island-parse is less than the cost of the null-word parse.

Suppose that after a parse that gets null words, we add ANY links to all the chosen disjuncts of all the linkages. Will we get something useful?

The only place where ANY connectors can be added are as shown in the examples above.


most probably a GPT system can do it if trained for that.

So, to be clear with what I am trying to do: The DL/NN systems proceed with gradient descent. I'm trying to do something "kind of similar" but by counting. I also get high-dimensional vector spaces, and I also see phenomena similar to the early wordvec results -- "King is to Queen what Paris is to London" type stuff, etc. Interpretability is a mixed bag, because ... my high-dimensional vector spaces are populated with lots of floats. Those floats are mutual information values, and are interpretable as such ... but ... its still lots of numbers, in the end.

The other big difference is that I'm just one guy doing something that no one understands vs. an army of thousands who are fine-tuning both the theory and the GPU code. I've got hundreds of good ideas worth exploring, but it takes me maybe 2-6 months to explore one good idea, so forward progress is on a glacial timescale.

I think that what I'm doing can match anything that can be done with DL/NN. It might even be cheaper to compute. But I am working with a large, complex, fragile system in which I create 3 bugs for every two changes. There is a big advantage with working with simple systems, and my current setup is anything but simple.

linas commented 1 year ago

BTW, for the above example:


    +------------Xx-----------+
    +--->WV--->+---Osm---+    +--->Wa---+
    +->Wd--+-Ss+   +Ds**c+    |  +--Dm--+
    |      |   |   |     |    |  |      |
LEFT-WALL it is.v the thing.n , the answer.n

so inserting a "phantom comma" would have give a full parse. And that full parse most-closely resembles the island parse; the island parse is "almost correct", in this example.

ampli commented 1 year ago

Suppose that after a parse that gets null words, we add ANY links to all the chosen disjuncts of all the linkages. Will we get something useful?

The only place where ANY connectors can be added are as shown in the examples above.

My idea was to do a linkage again and then we are supposed to get them only in the places you indicated.

linas commented 1 year ago

My idea was to do a linkage again and then we are supposed to get them only in the places you indicated.

Oh, that would work!

linas commented 1 year ago

My idea was to do a linkage again and then we are supposed to get them only in the places you indicated.

A related idea would be to prune first, including power pruning, everything, and then add ANY links, (and then maybe prune one more time). This would keep the overall size low. It would also allow the detection of "cost inversions", where a parse with a single high-cost ANY link is cheaper than a full link, without out it.

ampli commented 1 year ago

so inserting a

Is there still a reason to proceed with your idea with the z connectors? (I started to investigate how to implement it.)

ampli commented 1 year ago

A related idea would be to prune first, including power pruning,

I guess you mean an additional pruning step, since at the start we always prune. But when this additional pruning should be done? If after the linkage, then I cannot see how the chosen disjuncts may have links that can be pruned.

linas commented 1 year ago

I mean, first do expression lookup as normal, and all pruning, as normal, and then add ANY connectors to all the disjuncts that remain after power-pruning. (then maybe prune one more time?) Then parse as normal.

linas commented 1 year ago

Except I don't know how to add ANY connectors to disjuncts. They would have to be added as optional connectors, added to the beginning and end of disjuncts, and maybe even in the middle (?!) so this would cause number of disjuncts to explode by a factor of 2^N or 3^N or more.

This would be easier, if multi-connectors were zero-or-more, instead of one-or-more.

ampli commented 1 year ago

I mean, first do expression lookup as normal, and all pruning, as normal, and then add ANY connectors to all the disjuncts that remain after power-pruning. (then maybe prune one more time?) Then parse as normal.

My suggestion of adding ANY only after a parse that gets null words, is in order to save processing time while getting the same final result.

then maybe prune one more time?

I cannot see how it can prune anything then, since the added disjuncts with ANY can connect to the other ones, and those w/o ANY have already been checked in the initial pruning.

linas commented 1 year ago

My suggestion of adding ANY only after a parse that gets null words, is in order to save processing time while getting the same final result.

Yes, I understand. That would work.

But after I read this, it gave me another idea: add ANY links after pruning, before parsing. This is much less efficient, but would capture the case of cost-inversion. The ability to play with low-cost ANY links could be interesting.

ampli commented 1 year ago

Will it be useful to add such a "test" or Parse_option, which includes the link cost?

linas commented 1 year ago

Will it be useful to add such a "test" or Parse_option, which includes the link cost?

Yes. I now see three cases:

1) What you first proposed: parse, with null links, then, for each resulting linkage, add ANY connectors, reparse, show results.. 2) parse as normal, and if there are no complete linkages, add ANY links and try again. 3) Add ANY links before the first parse.

Case 1) is worth of a Parse_option, because it really does do something useful: it proposes a linkage, in the same format as a normal linkage, that contains null-linked words. It would be cute if we could print ??? for the link name, to indicate it was a guess. I was calling it XXX and ANY; other good names are UNK, MAYBE, GUESS or GSS.

Case 2) is worth a Parse_option. Although users might have trouble telling it apart from case 1, because it gives more-or-less the same results. I'm interested in comparing the performance of case1 and 2.

Case 3) is ... meh. So a "test". It's obviously slower and uses more memory than 1 or 2, and would only give different results if there was a cost inversion. It would allow us to find out how common such inversions are ...