opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
384 stars 119 forks source link

Sentence: bird flu was observed in which countries? #1050

Open ampli opened 4 years ago

ampli commented 4 years ago

In an unrelated search I encountered page 358 of "Intelligent Information and Database Systems: 8th Asian Conference ..., Part 2".

This conference was in 2016, but according to their benchmark time it seems they used the original CMU version (a common thing), but the problem is the same:

    +--------------------------------Xp--------------------------------+
    +--------------->WV--------------->+                               |
    +------>Wd------+                  |                               |
    |        +--AN--+---Ss--+----Pvf---+---MVp--+-------Jp------+      |
    |        |      |       |          |        |               |      |
LEFT-WALL bird.n flu.n-u was.v-d observed.v-d in.r [which] countries.n ?

(They turned out using another parser.)

On the other hand, in which countries? does parse:

Found 1 linkage (1 had no P.P. violations)
    Unique linkage, cost vector = (UNUSED=0 DIS= 2.00 LEN=4)

    +-------------Xp-------------+
    |       +------Jp-----+      |
    +-->Wj--+-JQ-+---Dmc--+      |
    |       |    |        |      |
LEFT-WALL in.r which countries.n ?

Here in.r uses its disjunct Wj- & JQ+ & J+ to attach to which countries, so as a test I tried adding the disjunct MVp- & JQ+ & J+.

It didn't work and the question is why. Fixing this as needed may also be interesting.

linas commented 4 years ago
Get a more detailed help on a variable as in "!help var".
linkparser> !bad
Display of bad linkages turned on.
linkparser> bird flu was observed in which countries ?
Found 2 linkages (0 had no P.P. violations)
    Linkage 1 (bad), cost vector = (UNUSED=0 DIS= 0.20 LEN=12)
"Misuse of preposition13"

    +-------------------------------Xp-------------------------------+
    +--------------->WV--------------->+                             |
    +------>Wd------+                  |        +------Jp-----+      |
    |        +--AN--+---Ss--+----Pvf---+---MVp--+-JQ-+---Dmc--+      |
    |        |      |       |          |        |    |        |      |
LEFT-WALL bird.n flu.n-u was.v-d observed.v-d in.r which countries.n ?

so there it is: "Misuse of preposition13"

Fixing this requires ... being clever. Usually by finding similar sentences that work, and stealing ideas from those. Simply disabling "Misuse of preposition13" will just increase the number of failures in corpus-basic.

linas commented 4 years ago

Also -- if the proceedings have an e-mail, please do send them and email and remind them that more modern versions exist ...

linas commented 4 years ago
bird flu was observed where?
bird flu was observed how?
bird flu was observed when?
when was bird flu observed?
In which countries was bird flu observed?

The first three fail completely; the last two work fine. The first three are "inverted questions" . Note how the last two use SI (inverted subject), which suggests that the first three need a new kind of link, maybe "QP" for "inverted question" Something like this:

----->+                             
      |        +---
Pvf---+---QP---+-JQ
      |        |    
observed.v-d in.r wh

But then you have to invent something to prevent QP from being used to parse I saw in which room. Hmmmm See https://www.abisource.com/projects/link-grammar/dict/section-JQ.html

Oh, OK, so then Pvf- & QP+ would work, it seems. That's because Pv is used for "was verbed" constructions, which are valid for inverted questions, but would not allow "I saw in which". To make it even tighter, use Pvf- & (WV- or CV-) & QP+ so that the participle must be identified as the head-verb.

Maybe instead of inventing a new link QP, there is some existing link we can reuse. Not sure, would have to review the documentation. It's likely that a new link might be needed, since questions are ... very different than normal sentences,and also LG is weaker with questions.

linas commented 4 years ago

(above comment edited)

ampli commented 4 years ago

LG is weaker with questions

This is pity, since people try to use it for decoding queries.

ampli commented 4 years ago

Also -- if the proceedings have an e-mail, please do send them and email and remind them that more modern versions exist ...

BTW, about 2 weeks ago I sent a letter on a similar thing to Prof. Ahn, who very recently (Oct 2019) published this paper on a system in which LG is used: A Function as a Service Based Fog Robotic System for Cognitive Robots. (No answer yet.)

linas commented 4 years ago

Its a pity

Do you want to try fixing it, or should I?

ampli commented 4 years ago

Do you want to try fixing it, or should I?

I tried just to add MVp in the Misuse of preposition13 rule and on first glance it looks fine:

--- a/data/en/4.0.dict
+++ b/data/en/4.0.dict
in.r:
   <alter-preps>
   or ({JQ+} & (J+ or Mgp+ or IN+) & (<prep-main-a> or FM-))
   or K-
   or (EN- & (Pp- or J-))
   or <locative>
   or [MVp- & B-]
   or (MG- & JG+)
-  or <null-prep-qu>;
+  or <null-prep-qu>
+  or (MVp- & JQ+ & J+);

--- a/data/en/4.0.knowledge
+++ b/data/en/4.0.knowledge
- JQ    ,  Mj    Wj    MX#j                  , "Misuse of preposition13" ,
+ JQ    ,  Mj    Wj    MX#j  MVp             , "Misuse of preposition13" ,

It didn't change the number of errors in corpus-basic. In corpus-fixes it reduced the number of errors from 379 to 373, when these sentences are now parsed:

Sophy wondered up to what number she should count
Sophy wondered up to what number to count
Sophy wondered up to what number to count to
Sophy wondered up to whose favorite number she should count
Sophy wondered up to whose favorite number to count
Sophy wondered up to whose favorite number to count to

Since they don't include in.r, this is only due to the addition of MVp in the said PP rule.

Summary of errors by corpus:

corpus now patched diff linkage-limit
basic 82 82 0 1000
fixes 379 373 -6 1000
fix-long 9 9 0 10000
failures 1556 1555 -1 1000
pandp-union 2016 2007 -9 1000
pandp-union 1998 1990 -8 30000

With the long-sentences batches I just tried -limit=30000. The pandp-union corpus processing then takes much time and maybe a lower value would be enough (I have more to say about that...). The difference between the number of "fixed" sentences in pandp-union seems to be due to a different number of "combinatorial explosions" due to the changed rules (but I'm not sure - we can fine the different sentence and investigate it).

So based on these checks maybe this change is fine. However, I guess you will want to investigate:

  1. The reason of the unexpected fixes in corpus-fixes.
  2. Some additional correct sentences that didn't parse before.
  3. Some additional wrong sentences that didn't parse before (as needed) - to validate that the proposed patch doesn't cause them to parse.
ampli commented 4 years ago

Minor editing of my previous message (table diff value + missing open parenthesize).

ampli commented 4 years ago

bird flu was observed where? bird flu was observed how? bird flu was observed when?

I tried to fix them by brute force, by adding what seems to be a missingQI+ when Pvf- is present, as hinted by:

 linkparser> 
    Linkage 2, cost vector = (UNUSED=1 DIS= 0.20 LEN=9)

    +------------------------Xp-----------------------+
    +--------------->WV--------------->+              |
    +------>Wd------+                  |              |
    |        +--AN--+---Ss--+----Pvf---+              |
    |        |      |       |          |              |
LEFT-WALL bird.n flu.n-u was.v-d observed.v-d [where] ?

Press RETURN for the next linkage.
linkparser> 
    Linkage 3, cost vector = (UNUSED=1 DIS= 1.10 LEN=10)

    +----------------------Xp---------------------+
    +-------------->WV-------------->+            |
    +------>Wd------+                |            |
    |        +--AN--+-------Ss-------+---QI---+   |
    |        |      |                |        |   |
LEFT-WALL bird.n flu.n-u [was] observed.v-d where ?

Instead of just adding QI+, I added the macro in which it resides. I have no idea if this is better.

 predicted.v-d realized.v-d discovered.v-d determined.v-d announced.v-d
 mentioned.v-d admitted.v-d recalled.v-d revealed.v-d divulged.v-d
 stated.v-d observed.v-d indicated.v-d stammered.v-d bawled.v-d
 analysed.v-d analyzed.v-d
 assessed.v-d established.v-d evaluated.v-d examined.v-d questioned.v-d
 tested.v-d hypothesized.v-d hypothesised.v-d well-established.v-d
 envisaged.v-d documented.v-d:

   ((<verb-sp,pp> & (<vc-predict>)) or
   (<verb-and-sp-i-> & ([<vc-predict>]0.2 or ())) or
   ((<vc-predict>) & <verb-and-sp-i+>) or
   <verb-and-sp-t>)
-  or (<verb-s-pv> & {THi+})
+  or (<verb-s-pv> & ({THi+} or <vc-predict>))
   or <verb-adj>
   or <verb-phrase-opener>;

The result is that these sentences get parsed, with no additional errors in the 5 tested corpus batches. E,g,:

    +-----------------------Xp----------------------+
    +--------------->WV--------------->+            |
    +------>Wd------+                  |            |
    |        +--AN--+---Ss--+----Pvf---+---QI---+   |
    |        |      |       |          |        |   |
LEFT-WALL bird.n flu.n-u was.v-d observed.v-d where ?

Supposing this is correct (I don't know), then still:

  1. There is a need to justify adding the whole <vc-predict> and not just part of it.
  2. There is a need to think on examples that may break this addition.
  3. Maybe it is needed for other verbs too, so this change has to be done in one (or more) of the verb macros.
linas commented 4 years ago

I've been hacking on this, look at my branch "qi" I have not tested for regressions.

linas commented 4 years ago

Regarding MVp, the page https://www.abisource.com/projects/link-grammar/dict/section-JQ.html gives the example: "*I saw in which room"

linas commented 4 years ago

So pull req #1051 fixes this but I did not measure reqgressions. I'm also contemplating chaning Misuse of preposition14 so that "You slept with who?" will parse.

linas commented 4 years ago

And .. in the finest of traditions, the changes to the dict mean that all run-times are now slower by 10% or 20% or something like that ... al of your performance tuning gets blown away by some fairly minor dict changes that one might think would not matter.

Perhaps it's wrong to think of them as "minor" -- {QI+} is now & with lots of common verbs: did said, and many many others. The total number of expressions is significatntly larger, the total number of disjuncts is larger. .. It would be interesting to look at these totals, and the distributions of them, for typical dictionaries, over time.

It would be much simpler, and also be interesting to see how dictionaries from different eras compare on performance, on the current parser.

linas commented 4 years ago

I remeasured performance, correctly, this time; the performance hit is minor

ampli commented 4 years ago

I fetched your "qi" branch and made some tests.

Regarding MVp, the page https://www.abisource.com/projects/link-grammar/dict/section-JQ.html gives the example: "*I saw in which room"

The problem with my (and your) fix to bird flu was observed in which countries? it that now the said example "*I saw in which room" does parse.

It seems to me that the root of the problem is that in the fix we threat "was observed" as "passive participles" i.e. a verb and then there is no way to distinguish the different cases (as "saw" is a verb too).

So I propose instead that the role of ""was observed" in that sentence is "predicate adjective" , and at this role its should use Pa & JQ & J+.

I..e. something is predicated and on that basis we ask a question where, when, in which countries etc. This way in the Misuse of preposition13 rule we can require Pa instead of Mvp, and this Pa should also be added to Misuse of preposition14.

This proposal doesn't handle the "up to" sentences, so they remain unfixed. I think their fix is different, so we can discuss it later (unless it seems to you related).

To check this proposal O made this changes:

--- a/data/en/4.0.dict
+++ b/data/en/4.0.dict
predicted.v-d realized.v-d discovered.v-d determined.v-d announced.v-d
...   
   ((<verb-sp,pp> & (<vc-predict>)) or
   (<verb-and-sp-i-> & ([<vc-predict>]0.2 or ())) or
   ((<vc-predict>) & <verb-and-sp-i+>) or
   <verb-and-sp-t>)
   or (<verb-s-pv> & {THi+})
+  or (Pa- & (MVp+ or <vc-predict>))
   or <verb-adj>
   or <verb-phrase-opener>;

 in.r:
   <alter-preps>
   or ({JQ+} & (J+ or Mgp+ or IN+) & (<prep-main-a> or FM-))
   or K-
   or (EN- & (Pp- or J-))
   or <locative>
   or [MVp- & B-]
   or (MG- & JG+)
-  or <null-prep-qu>;
+  or <null-prep-qu>
+  or (MVp- & JQ+ & J+);

--- a/data/en/4.0.knowledge
+++ b/data/en/4.0.knowledge
- JQ    ,  Mj    Wj    MX#j                  , "Misuse of preposition13" ,
- Jw    ,  Mj    Wj    MX#j                  , "Misuse of preposition14" ,
+ JQ    ,  Mj    Wj    MX#j Pa               , "Misuse of preposition13" ,
+ Jw    ,  Mj    Wj    MX#j Pa               , "Misuse of preposition14" ,

Results:

...
    +-------------------------------Xp-------------------------------+
    +---------->WV--------->+                                        |
    +------>Wd------+       |                   +------Jp-----+      |
    |        +--AN--+---Ss--+----Pa----+---MVp--+-JQ-+---Dmc--+      |
    |        |      |       |          |        |    |        |      |
LEFT-WALL bird.n flu.n-u was.v-d observed.v-d in.r which countries.n ?
...
    +-----------------------Xp----------------------+
    +---------->WV--------->+                       |
    +------>Wd------+       |                       |
    |        +--AN--+---Ss--+----Pa----+---QI---+   |
    |        |      |       |          |        |   |
LEFT-WALL bird.n flu.n-u was.v-d observed.v-d where ?

And, as needed, "*I saw in which room" doesn't parse:

    +---->WV--->+
    +->Wd--+Sp*i+-MVp-+------Ju------+
    |      |    |     |              |
LEFT-WALL I.p saw.w in.r [which] room.n-u
...
!bad
...
"Misuse of preposition13"

    +---->WV--->+     +-----Js----+
    +->Wd--+Sp*i+-MVp-+-JQ-+-Ds**c+
    |      |    |     |    |      |
LEFT-WALL I.p saw.w in.r which room.s

Corpus error count:

corpus now patched diff linkage-limit
basic 82 82 0 1000
fixes 379 379 0 1000
fix-long 9 9 0 10000
failures 1556 1554 -2 1000
pandp-union 2016 2011 -5 1000
ampli commented 4 years ago

slower by 10% or 20% or something like that

Can it be that you tested it on intermediate changes? For me the slowness of your "qi" branch is only a very few percents at most. In any case I have a WIP on improving expression handling and also pruning (both expression and power) so this may allow increasing the dict complexity without much more overhead.

We can also look at that from another angle: Improving the library speed will allow a much more complex dict without being too sluggish.

ampli commented 4 years ago

I remeasured performance, correctly, this time; the performance hit is minor

Only now I see that you addressed that by now...

ampli commented 4 years ago

After you applied PR #1051, we get:

linkparser> Sophy wondered up to what number to count to
Found 28 linkages (28 had no P.P. violations)
    Linkage 1, cost vector = (UNUSED=0 DIS= 6.00 LEN=14)

    +-------->WV------->+-----MVp-----+-----J-----+---------B---------+
    +-->Wd---+---Ss*s---+---MVa--+    +-JQ-+-Ds**c+---R--+--I--+--MVp-+
    |        |          |        |    |    |      |      |     |      |
LEFT-WALL Sophy.f wondered.v-d up.e to.r what number.n to.r count.v to.r

Among other things, this seems to me wrong:

Ss*s---+---MVa--+ 
       |        | 
 wondered.v-d up.e

Isn't up a modifier of to and not wondered? Compare the symmetric sentence in the context of reverse counting: Sophy wondered down to what number to count to Clearly down here is not a verb modifier. Can the problem be solved by attaching up etc. to to using Mj? BTW, I also don't think 'up to' here is an idiom, because instead of up I can think of some other words (down, approximately, nearly, exactly).

Compare to that:

linkparser> Sophy wondered right to which one she should stand on the stage
Found 218 linkages (8 had no P.P. violations)
    Linkage 1, cost vector = (UNUSED=0 DIS= 0.53 LEN=34)

                        +-------------------------MVp-------------------------+
                        |          +--------------------Mp--------------------+
                        |          |       +-------------CV------------>+     |
                        |          |       +------Cs-----+              |     |
    +------->CPx--------+          |       +----Js---+   |              |     +---Js---+
    +-->Wa---+          +---SIsj---+---Mj--+-JQ-+-Ds-+   +--Ss--+---I---+     |  +Ds**c+
    |        |          |          |       |    |    |   |      |       |     |  |     |
LEFT-WALL Sophy.f wondered.q-d right.n-u to.r which one she should.v stand.v on the stage.n
linas commented 4 years ago

I saw in which room

This is actually ambiguous. In the surface, it seems like an absurd sentence, but it's a plausible reply to the question: "Did you see in which room they held bingo night?" Anyway, your proposal can be simplified to:

--- a/data/en/4.0.knowledge
+++ b/data/en/4.0.knowledge
@@ -217,8 +217,8 @@ CONTAINS_ONE_RULES:
  Mj    ,  Jw    JQ                          , "Incorrect relative10" ,
  MX#j  ,  Jw    JQ                          , "Incorrect relative11" ,
  Wj    ,  Jw    JQ                          , "Misuse of preposition12" ,
- JQ    ,  Mj    Wj    MX#j  MVp             , "Misuse of preposition13" ,
- Jw    ,  Mj    Wj    MX#j                  , "Misuse of preposition14" ,
+ JQ    ,  Mj    Wj    MX#j  Pv              , "Misuse of preposition13" ,
+ Jw    ,  Mj    Wj    MX#j  Pv              , "Misuse of preposition14" ,
  B#j   ,  Jr                                , "Incorrect relative15"    ,
  Jr    ,  B#j                               , "Incorrect relative16"    ,
 ; The two below prevent "How big?" and "How quickly?"

Also, yes, the Sophy sentences are broken

linas commented 4 years ago

I think up to could be an idiom, here, because:

*Sophy wondered exactly to what number to count to
Sophy wondered exactly what number to count to
How high did it go?
Up to what mark did it reach?
Exactly what mark did it reach?
up to where did it go?
Up to how many gallons were lost?
down to which floor did it drop?
down to what depravities did he sink?

The above are easily fixed by up_to down_to: EW+;

A proper fix for the others requires link-crossing. This is best illustrated by pondering the sentence: "Sophy wondered [up to] whose favorite number she should count to" and then realizing that [up to] needs to modify "number" not "whose". Unfortunately, this is not possible without link-crossing.

There is a work-around for link-crossing, but it is hacky: I did it once, here: see Jj and Jk at bottom of page at https://www.abisource.com/projects/link-grammar/dict/section-J.html

Doing such a hack in the dozen-plus cases where it is needed is painful and ugly. I would rather be able to say "link X can cross link Y or Z once". I don't think two crossings are ever needed. I don't think that allowing anything to cross anything is generally allowed. The README has accumulated a bunch of these...

ampli commented 4 years ago

I would rather be able to say "link X can cross link Y or Z once".

Will it be fine to do the hack automatically on dict read according to such definitions?

linas commented 4 years ago

Will it be fine to do the hack automatically on dict read according to such definitions?

Hmm. That's an intersting idea. Yeah, maybe I like it. So we need several things:

1) Some way to write down "link X can cross link Y" in the dictionary.

2) by analogy to Jj and Jk, your hack to auto generate Xj and Xk and then auto-add Xj- & Y & Xk+

Yeah, I like that. The tricky part to 2 is to put the subscripts in a slot which is unused. Maybe we could put them in the "first" slot, like h and d for head/dependent, except they're cross-from-left and cross-from right, so maybe l and r and the ascii diagrammer could use parents to print them! Like so:

               +------------------+
               |              +--)|(--------+
               |              |   |         |
He had been allowed to eat a cake by Sophy that she had

so the parents make a little "tunnel" where the link crosses, and the logic for that would be just like drawing the arrow-heads for h and d arrows. So yeah, that seems slick...

To be clear: you would auto-add lX- & Y & rX+ ...it might even be possible to do this with an m4 macro hack. Ugh.

The example in https://www.abisource.com/projects/link-grammar/dict/section-J.html is actually complex, because, there, the J link cross two others: it crosses both the I and the VJlpi links...

So for this example, Yikes... its yucky. In the current dict, its I- & Jj- & VJlpi- & VJrpi+ & Jk+ and so that's not obvious that J is crossing I and VJlpi but is not crossing VJrpi+ ... what a mess.

The render would be


          +-------I---------+
          |     +--VJlpi----+                        
          |     |    +-Js--)|(----------Js----------+
          |     +-MVp+      +--VJrpi-+--MVp-+---Js--+
          |     |    |      |        |      |       |
   ...  to.r look.v at  and.j-v  listen.v to.r everything

and the notation would be I- & lJs- & VJlpi- & VJrpi+ & rJs+

Instead of l and r maybe s and r because l and 1 and I all look alike too much. s is for Latin sinister.

Or p and q as a visually mirror-symmetric pair. Or w and v . Or e and a

Hmm. except for p and q, it appears that the Latin alphabet was explicitly designed to avoid mirror-symmetric letters. Interesting. This is also the case for cyrillic and greek. ... interesting ...

ampli commented 4 years ago

When I look at !!and-j-v, I see several other VJlx- & VJrx+ constructs, and even a very similar one to the one that has the Jj Jk device: ({Xd-} & hVJlpi- & {N+} & {TO+} & hVJrpi+). Why they don't also need this device, especially the last one that also includes hVJlpi- & hVJrpi+?

Another question: I don't like the complication of using the UC front position. Is there something bad in using a bool mark in the Connector struct?

linas commented 4 years ago

The Jj - Jk device is "recent" (well, OK maybe over a year old now) and is used in only one place (OK, now maybe two), and was created as an experiment to see how well it works (how convenient or confusing it is, how much trouble it causes vs. how much trouble it saves...) It was never deployed on a wide-scale basis. The post above (https://github.com/opencog/link-grammar/issues/1050#issuecomment-557770152) is the newest/best way I can think of of making it fully generic and "obvious".

the UC front position

I don't understand the question. The goal of fronting UC is to have a notation in 4.0.dict to indicate that "opposites connect". Maybe this could be moved so that it comes after the +/- connector-dir. Or allows additional symbols besides +/- ...

ampli commented 4 years ago

The goal of fronting UC is to have a notation in 4.0.dict to indicate that "opposites connect".

What is special about their matching rules? For me it seems as if regular Js- and Js+ connectors are fine, and only the code that draws the diagram needs to know they denote a cross-link (and hence my suggested bool mark).

linas commented 4 years ago

and only the code that draws the diagram needs to know

And how will it know this? it's not just J that might cross, it could be .. A or S or a dozen others.

ampli commented 4 years ago

And how will it know this? it's not just J that might cross, it could be .. A or S or a dozen others.

My ideas is that these connectors (in your example A, S and others) that serve as "bypass" connectors will be marked in their connector struct. Questions: Why is there any need to explicitly make these marks in the connector string? Is there any code, beside the diagram drawing code, that needs to be aware that there is anything special here?

linas commented 4 years ago

in their connector struct.

I'm not concerned with how they are handled in the C code. Finding a good representation for 4.0.dict is my primary concern.

Is there any code that needs to be aware

Presumably, "most" applications of LG are interested in the dependency diagram, in the abstract, as a graph, and now, as a non-planar graph. So there will need to be a step that says "a hah, here's something that in LG looks like two links, but its really only just one." The app itself could figure that out, or we could provide that extra step ourselves, in the LG api. In addition, the app might want to know which links cross.

The only real problem with this is that there are very few, approaching zero apps of LG, at least, that are public, that anyone talks about. Every now and then I get hints of proprietary apps, but they never seem heavily vested. So all this is very hypothetical.

Yes, other parts of opencog use LG, but ... not very well, not very robustly, not very deeply.

ampli commented 4 years ago

I'm not concerned with how they are handled in the C code. Finding a good representation for 4.0.dict is my primary concern.

But the whole idea is that the special "bypass" connectors don't appear at all in the in 4.0.dict, as they are only inserted later by the LG library code. So how they can be represented there? What is to be representaed there is something like: <XLINK>: Js+ & VJrpi-; % Connection from Js may cross the rest of connectors. (For now it seems to me no need to specify the less deeper connectors too that it also would cross, unless this may lead to incorrect parses. I also don't know if an exact connector match should be done for VJrpi- or "easy-match".)

If you mean to their representation in printing of the actual expression which is used (or its disjuncts) then it is really doesn't mater from programming standpoint, in which representation they are displayed, and indeed the most convenience representation should be used.

and the notation would be I- & lJs- & VJlpi- & VJrpi+ & rJs+ Instead of l and r maybe s and r because l and 1 and I all look alike too much. s is for Latin sinister. Or p and q as a visually mirror-symmetric pair. Or w and v . Or e and a

I still didn't understand if the LG library code should make any special interpretation of these leading LC letters (supposing it already knows these are "bypass" connectors - after all the LG library code knows what it added). For now it seems to me these special letters don't play any role in the connector matching algorithm (unlike h/d and the rest of the letters in the connector string), and even not in the drawing algorithm (since it is already known which connectors are the "bypass" ones).

I guess I'm not clear enough in my questions and proposals, or that I didn't understand something (or both). I will try to make a real implementation and see if it works fine, but still answers to the above would help.

What also would help me are additional diagrams of the desired results. E.g. for some of the "Sophy" sentences (only links from words that have cross-links are needed).

ampli commented 4 years ago
               +------------------+
               |              +--)|(--------+
               |              |   |         |
He had been allowed to eat a cake by Sophy that she had

When I try to draw this using the current parse, I get:

                              +-------------------Bs------------------+
                              |                                       |
      +----------Mvp---------)|(---+                                  |
      |                       |    |                                  |
----->+------IV---->+----Os---+---)|(---R--------+---------CV-------->+----
-Pv---+---TO---+-I*t+   +Ds**c+-Mp-+-Js-+        +--Cr-+--Ss-+---PP---+--Ox
      |        |    |   |     |    |    |        |     |     |        |    
 allowed.v-d to.r eat.v a  cake.s by Sophy.f that.j-r she had.v-d made.v-d 

I.e. 2 links crossings. What did I miss?

And this current linkage seems to need 3 link crossings:

Linkage 2, cost vector = (UNUSED=0 DIS= 0.50 LEN=39)

                    +------------------------------MVa---------------------
                    |         +-------------------Bs------------------+    
      +------IV---->+----Os---+---------R--------+---------CV-------->+    
-Pv---+---TO---+-I*t+   +Ds**c+-Mp-+-Js-+        +--Cr-+--Ss-+---PP---+--Ox
      |        |    |   |     |    |    |        |     |     |        |    
 allowed.v-d to.r eat.v a  cake.s by Sophy.f that.j-r she had.v-d made.v-d 

Maybe this MVa, and another one from allowed to specially at linkage 13, are incorrect?

linas commented 4 years ago

But the whole idea is that the special "bypass" connectors don't appear at all in the in 4.0.dict

Oww. I forgot that is what we're talking about. I have to run out now, but will re-read and rethink.

linas commented 4 years ago

For now it seems to me no need to specify the less deeper connectors too that it also would cross, unless this may lead to incorrect parses.

Given that even our simplest examples seem to need to cross multiple links, this seems like a reasonable assumption. But really, we'll have to test and see.

I also don't know if an exact connector match should be done for VJrpi- or "easy-match".

It should be a regular match.

I still didn't understand if the LG library code should make any special interpretation of these leading LC letters. For now it seems to me these special letters don't play any role in the connector matching algorithm.

Sorry for generating confusion about this before. To answer this question directly: in the prototype, I used two different subtypes, Jj and Jk so that Jj formed the left half of the underpass link, and Jk formed the right half of the link. This seemed like the right way to do it; and I'm not sure what would have happened if I'd just used a single subtype, Jx for example. I'll try a brief experiment now ...

... done. I collapsed Jj and Jk into just Jx, and nothing seemed to change, but I was sloppy, so might be wrong...

linas commented 4 years ago

I.e. 2 links crossings. What did I miss?

Nothing; that's correct.

this current linkage seems to need 3 link crossings:

It seems to, but if we can block it from happening, that's fine, because its wrong/implausible.

ampli commented 4 years ago

I updated the above diagram with what seems the correct disjunct of allowed:

                             +-------------------Bs------------------+
     +-------------MVp------)|(---+                                  |
     |             +----Os---+---)|(---R--------+---------CV-------->+----
Pv---+---MVi--+--I-+   +Ds**c+-Mp-+-Js-+        +--Cr-+--Ss-+---PP---+--Ox
     |        |    |   |     |    |    |        |     |     |        |    
allowed.v-d to.r eat.v a  cake.s by Sophy.f that.j-r she had.v-d made.v-d 

But in any case there is a fundamental problem in the "bypass" connector idea here, as the word by cannot be connected twice to the word cake (Mp and MVp).

So I have no idea how to implement this and at the same time to preserve the link:

   +-Mp-+
   |    |
cake.s by
ampli commented 4 years ago

Or maybe it is wrong and should be omitted in any case? See linkage 2.

    Linkage 1, cost vector = (UNUSED=0 DIS=-0.51 LEN=17)

                                               +------MVp-----+
    +------------>WV------------>+------IV---->+----Os---+    |
    +->Wd--+-Ss-+--PPf--+---Pv---+---TO---+-I*t+   +Ds**c+-Mp-+-Js-+
    |      |    |       |        |        |    |   |     |    |    |
LEFT-WALL he had.v-d been.v allowed.v-d to.r eat.v a  cake.s by Sophy.f

Press RETURN for the next linkage.
linkparser> 
    Linkage 2, cost vector = (UNUSED=0 DIS= 0.10 LEN=17)

                                               +------MVp-----+
    +------------>WV------------>+------IV---->+----Os---+    |
    +->Wd--+-Ss-+--PPf--+---Pv---+---TO---+-I*t+   +Ds**c+    +-Js-+
    |      |    |       |        |        |    |   |     |    |    |
LEFT-WALL he had.v-d been.v allowed.v-d to.r eat.v a  cake.s by Sophy.f
linas commented 4 years ago

preserve the link

Yes, there are some rare cases where one might like to have two different links connecting the same pair of words. This is one of them.

Or maybe it is wrong

It's not wrong, but it's also not exactly right. cake --Mp-- by Sophy implies that Sophy made the cake (which later turns out to be true... but we don't know that yet); in the first half of this sentence, the correct parse is that its "allowed by Sophy": so really the correct parse has allowed --MV-- by Sophy. So in this case, dropping the Mp link is not wrong, and in fact, it's more correct to kill the Mp link.

ampli commented 4 years ago

In my WIP, I started with: <fxlink>: [MVp+]-1.65 & R+; Comments:

  1. This causes the first connector to be inserted in the expressions before the second one using the indicated cost, and also to be inserted at the opposite jet (with the opposite direction sign, - for this example, but with cost 0).
  2. The negative cost was needed so I will see the crossing links among the first linkages.
  3. I used lowercase <fxlink> to distinguish the macro from a regex label (so we can, for example, to write a dict analyzer that will, among other things, warn on regexp labels without regexes). Another name, or totally another label format (e.g. @fxlink) can be used.
  4. For more than one definition, in order to prevent duplicate word error a subscript can be used, as in: <fxlink>.something: [MVp+]-1.65 & R+; <fxlink>.anotherthing: [MVs+]-1.65 & R+; This may help with error messages. However, for now I just allowed multiple <fxlink> labels.
  5. For insertion efficiency, a more complex format may be needed: <fxlink>.something: [MVp+ or MVs+]-1.65 & R+;

However, insertion of MVp before any R+ causes 2x slowness on the long batches due to the added disjuncts (a big portion of them is not getting pruned). Since the observed results only include crossing of both R and Bs, Bsp or Bsw links, it seems this would be faster: <fxlink>: [MVp+]-1.65 & R+ & Bs+; (I haven't completed yet its implementation, which is much more complex than inserting before a single connector, so I don't have slowness-factor result yet, but I guess the result will be faster)

BTW, this idea of defining link crossing has a problem that it cannot force the regular connector match rules. E.g. if you would like (just for the example) MV to cross, then you can get a bad link like ---MVp---)|(---MVs. I don't know how to overcome that.

Another problem I had to overcome is preventing label crossing in the diagram. For example, in this kind of crossing

                             +-------------------Bs------------------+
     +-------------MVp------)|(---+                                  |
     |             +----Os---+---)|(---R--------+---------CV-------->+----
Pv---+---MVi--+--I-+   +Ds**c+-Mp-+-Js-+        +--Cr-+--Ss-+---PP---+--Ox
     |        |    |   |     |    |    |        |     |     |        |    
allowed.v-d to.r eat.v a  cake.s by Sophy.f that.j-r she had.v-d made.v-d 

the label on a vertical cross link R can be overwritten by `)|('. The label can be moded, but it is complex to ensure that there will be enough room for that. Instead, I chose another solution: Only allow crossing vertical lines, so the current printout in my WIP is:

                                                         +--------------------Bs-------------------+
                                                         +----------R---------+                    |
                                 +----------MVp---------)|(-MVp-+             |                    |
    +------------>WV------------>+             +----Os---+      |             +---------CV-------->+-----MVa----+
    +->Wd--+-Ss-+--PPf--+---Pv---+---MVi--+--I-+   +Ds**c+      +-Js-+        +--Cr-+--Ss-+---PP---+--Ox-+      |
    |      |    |       |        |        |    |   |     |      |    |        |     |     |        |     |      |
LEFT-WALL he had.v-d been.v allowed.v-d to.r eat.v a  cake.s   by Sophy.f that.j-r she had.v-d made.v-d him specially

It is still able to generate horizontal line crossings if no other choice because I didn't remove the code that does it (I don't have such examples for now).

The +----------MVp---------)|(-MVp-+ printout can be modified to be (not implemented) +---------------MVp----)|(-----+ (one label in the center of the link line) but: 1. I'm not sure it is better; 2. A complex logic would be needed (but it is still straightforward). So I leave it as is for now.

ampli commented 4 years ago

(My previous post above has been edited -- as usual -- so it is better to read it on the web.) I got the following linkages, please check if they make sense:

linkparser> Onward went the cavalry, spurred to extraordinary exertion by the fact that provisions began to run short.
                                               +----------------------------------------------------Xc---------------------------------------------------+
                                               |                                 +----------------------------Bsd----------------------------+           |
                                               |                                 +------------R-----------+                                  |           |
                                               +---------------MVp--------------)|(-MVp--+                |                                  |           |
    +------>WV------>+---->SIs----+----MXsp----+        +-----------Ju-----------+       +---Jp---+       +----------CV-------->+-----IV---->+           |
    +-->Wp-->+<-PFb<-+     +-Ds**c+     +--Xd--+---MVp--+         +-------A------+       |  +D*u*c+       +----Cr----+---Sp*t---+---TO--+-I*t+--MVa-+    |
    |        |       |     |      |     |      |        |         |              |       |  |     |       |          |          |       |    |      |    |
LEFT-WALL onward went.v-d the cavalry.n , spurred.v-d to.r extraordinary.a exertion.n-u by the fact.n that.j-r provisions.n began.v-d to.r run.v short.e .

...

                                               +----------------------------------------------------Xc---------------------------------------------------+
                                               |                                 +----------------------------Bsw----------------------------+           |
                                               |                                 +------------R-----------+                                  |           |
                                               +---------------MVp--------------)|(-MVp--+                |                                  |           |
    +------>WV------>+---->SIs----+----MXsp----+        +-----------Ju-----------+       +---Jp---+       +----------CV-------->+-----IV---->+           |
    +-->Wp-->+<-PFb<-+     +-Ds**c+     +--Xd--+---MVp--+         +-------A------+       |  +D*u*c+       +----Cr----+---Sp*t---+---TO--+-I*t+--MVa-+    |
    |        |       |     |      |     |      |        |         |              |       |  |     |       |          |          |       |    |      |    |
LEFT-WALL onward went.v-d the cavalry.n , spurred.v-d to.r extraordinary.a exertion.n-u by the fact.n that.j-r provisions.n began.v-d to.r run.v short.e .

...

                                               +----------------------------------------------------Xc---------------------------------------------------+
                                               |                                 +-----------------------------Mv----------------------------+           |
                                               |                                 +--------------------------Bs--------------------------+    |           |
                                               |                                 +------------R-----------+                             |    |           |
                                               +---------------MVp--------------)|(-MVp--+                |                             |    |           |
                     +---->SIs----+----MXsp----+        +-----------Ju-----------+       +---Jp---+       +----------CV-------->+       |    |           |
    +-->Wp-->+<-PFd<-+     +-Ds**c+     +--Xd--+---MVp--+         +-------A------+       |  +D*u*c+       +----Cr----+---Sp*t---+--MVp--+    +--MVa-+    |
    |        |       |     |      |     |      |        |         |              |       |  |     |       |          |          |       |    |      |    |
LEFT-WALL onward went.v-d the cavalry.n , spurred.v-d to.r extraordinary.a exertion.n-u by the fact.n that.j-r provisions.n began.v-d to.r run.v short.e .
ampli commented 4 years ago

A proper fix for the others requires link-crossing. This is best illustrated by pondering the sentence: "Sophy wondered [up to] whose favorite number she should count to" and then realizing that [up to] needs to modify "number" not "whose". Unfortunately, this is not possible without link-crossing.

Which link should be allowed to cross which other link in that case? I would like to add these sentences as a tests to my WIP.

ampli commented 4 years ago

For the sentences: I want to look at and listen to everything. We currently get:

    +-------------------------------Xp-------------------------------+
    |            +----------IV-------->+                             |
    |            |     +------I*t------+                             |
    +---->WV---->+     |     +<-VJlpi<-+-----------Jk----------+     |
    +->Wd--+-Sp*i+--TO-+     +-MVp+-Jj-+->VJrpi>+--MVp-+---Js--+     |
    |      |     |     |     |    |    |        |      |       |     |
LEFT-WALL I.p want.v to.r look.v at and.j-v listen.v to.r everything .

In order to get the cross link, I added:

: [Js-]-1 & hVJlpi-; And also made this change (note the `@`): ```diff : - (Ss*s+ & ) or SIs- or (Js- & ({Jk-} or {Mf+})) or Os- + (Ss*s+ & ) or SIs- or (@Js- & {Mf+}) or Os- ``` I then get: ```text +--------------------------------Xp-------------------------------+ +---------------->WV--------------->+ | | +------------Sp*i------------+ | | | +<--------VJlpi<-------+ | | | +-----IV--->+ +--Js)|(----------Js----------+ | +->Wd--+ +--TO-+-I*t-+-MVp+ +->VJrpi>+--MVp-+---Js--+ | | | | | | | | | | | | LEFT-WALL I.p want.v to.r look.v at and.j-v listen.v to.r everything . ``` But I also get: ```text Linkage 2, cost vector = (UNUSED=0 DIS= 0.00 LEN=25) +--------------------------------Xp-------------------------------+ +---------------->WV--------------->+ | | +------------Sp*i------------+ | | | +<--------VJlpi<-------+------MVp------+ | | | +-----IV--->+ +--Js)|(----------Js-)|(------+ | +->Wd--+ +--TO-+-I*t-+-MVp+ +->VJrpi>+ +---Js--+ | | | | | | | | | | | | LEFT-WALL I.p want.v to.r look.v at and.j-v listen.v to.r everything . ``` This 2-segment cross-link display is strange because my code was designed to handle 2-segment links only, and it happens that it somehow handles more than planned. **EDIT**: This is still a 2-segment link, and the `MVp` just should have been drawn below the `Js` link (to be fixed). But links with more than 2 segments seem to me possible. And also the badly printed diagram: ```text +--------------------------------Xp-------------------------------+ | +----------IV--------->+ | | | +--Js)|(----------Js----------+ | | | +-------I*)|(----+ | | +---->WV---->+ | +<--VJlpi<-+------MVp------+ | | +->Wd--+-Sp*i+--TO-+ +-MVp+ +->VJrpi>+ +---Js--+ | | | | | | | | | | | | LEFT-WALL I.p want.v to.r look.v at and.j-v listen.v to.r everything . ``` Note that the `I*t` lable is overwritten and the `VJlpi` is cross in the middle. I'm not sure it can be arranged to cross only vertical lines (like the case of the "cake" sentence) so I may need to implement the complex code for label relocation (cases like the above `I*t`) and for increase label spacing (cases like the above `VJlpi`). **EDIT**: This one can be fixed to cross a vertical line only. Are these linkages correct? If not, how can they be limited to correct ones only?
ampli commented 4 years ago

We also need to think on undesired effects og such cross links on postprocessing, since it would think that each of the fake-cross-link segments have a real link label.

ampli commented 4 years ago

The modified Js to @js turned out to be a bad solution (at least w/o further changes), since I now get the following parse:

linkparser> A picture of dogs are in the yard
Found 4 linkages (4 had no P.P. violations)
    Linkage 1, cost vector = (UNUSED=0 DIS= 1.05 LEN=14)

                       +------------Js------------+
    +---->Wa----+      |                +----Js---+
    |     +Ds**c+--Mf--+    +-Spx-+--Pp-+   +Ds**c+
    |     |     |      |    |     |     |   |     |
LEFT-WALL a picture.n of dogs.n are.v in.r the yard.n
ampli commented 4 years ago

Now this sentence parses, but yet with a bad diagram drawing (to be fixed):

    +-------------------------------------------------------------------Xp-------------------------------------------------------------------+
    |                                                             +-------------IV----------->+          +--Js-------------Js----------+     |
    +------------------>WV----------------->+                     |       +--------I*t--------+----->VJrpi---->+                       |     |
    +-------->Wd---------+-------Ss*s-------+        +-----CV---->+       |    +<----VJlpi<---+     +<--VJlpi<-+                       |     |
    |        +-----G-----+          +---E---+---TH---+-Cet-+--Ss--+---TO--+    +--MVp--+      |     +-MVp+     +->VJrpi>+--MVp-+---Js--+     |
    |        |           |          |       |        |     |      |       |    |       |      |     |    |     |        |      |       |     |
LEFT-WALL Shel[!] Silverstein[!] once.e said.v-d that.j-c he wanted.v-d to.r go.v everywhere ,.j look.v at  and.j-v listen.v to.r everything .
linas commented 4 years ago

I'm slightly confused by the statements about <fxlink> - presumably, this is something that gets added to individual words, on an as-needed basis, and is not something globally applied, right?

Things like @Js- are difficult; there need to be additional constraints that only allow multiple Js if there are VJ's in the sentences, and only if they're connecting. Forcing stuff like this quickly gets convoluted and tricky.

linas commented 4 years ago

I got the following linkages Onward went the cavalry, spurred to extraordinary exertion by the fact that provisions began to run short.

They are all wrong; there should not be any connection between "exertion" and "that". This is explained in https://www.abisource.com/projects/link-grammar/dict/section-B.html -- so "the dog I had chased was black" -- in this case "I had chased" is modifying "dog" with a B link. But "provisions began to run short" is not modifying "exertions".

In the Sofie example, "she had made" is a B-modifier of "cake", so there, the B is correct.

ampli commented 4 years ago

I'm slightly confused by the statements about <fxlink> - presumably, this is something that gets added to individual words, on an as-needed basis, and is not something globally applied, right?

It is done according to your specification in https://github.com/opencog/link-grammar/issues/1050#issuecomment-557770152:

  1. Some way to write down "link X can cross link Y" in the dictionary.
  2. by analogy to Jj and Jk, your hack to auto generate Xj and Xk and then auto-add Xj- & Y & Xk+

(See the whole specification there.)

So the <fxlink> definitions are globally applied to the whole dictionary. I used: <fxlink>: MVp+ & R+; To say that MVp is allowed to cross evey R. But this is unneeded permissive so I intended to try: <fxlink>: MVp+ & R+ & Bs+; To say that MVp is allowed to cross only both R+ & Bs+ at once. (I'm still in the middle of writing a more complex connector sequence matcher to allow more flexible <fxlink> syntax.)

Things like @Js- are difficult; there need to be additional constraints that only allow multiple Js if there are VJ's in the sentences, and only if they're connecting.

If you can specify these additional constrains exactly than maybe I will be able to enforce them automatically.

They are all wrong; there should not be any connection between "exertion" and "that".

So it turns out that <fxlink>: MVp+ & R is doing the right thing for He had been allowed to eat a cake by Sophy that she had made him specially but not for: Onward went the cavalry, spurred to extraordinary exertion by the fact that provisions began to run short. even though the exact same links are allowed to cross.

Hence my question is how to limit the cross link specification so it will not generate wrong cross links like that.

linas commented 4 years ago

is doing the right thing for Sophy but not for exertion.

Yes. These two sentences "obviously" differ in that, for the Sophy sentence, cake has O- & R+ & B+ -- that is, a rule which says "direct objects can have relative modifiers". By contrast, exertion has J- & R+ & B+ so maybe this should be disallowed?

Some experimentation shows that this nonsense sentence does get a parse: Onward went the cavalry, spurred to extraordinary exertion that provisions began to run short. - again with the troublesome J- & R+ & B+ disjunct. But this one makes sense: Onward went the cavalry, eating so much that provisions began to run short. and it uses the O- & R+ & B+ disjunct. Also this sentence makes sense, and has a good parse: Onward went the cavalry, showing such extraordinary gluttony that provisions began to run short.

I cannot think of any sentences that require J- & R+ & B+ so maybe removing this from the dict is the correct thing to do? That requires another experiment ... remove it, and see what fails (if anything). I'll try that experiment now.

In general, I don't really like the global context, because, in general, whether something is allowed or not depends on the local context. The above provides an example of local context: was the (R+ & B+) in a disjunct with J- or with O- or with something else?

To answer your other question about VJ: this requires enforcing long-range order, and long-range order can be enforced with additional subscripts. For example, one possible fix is to create J***v+ so that v prevents connections to anything that doesn't have a J***v- & VJ+ -- do this by making all other J have J***x so that the x blocks the connection. (There might be other solutions. Some of the post-processing rules do this kind of enforcement, but in a different way. I tend to not like post-processing)

This is a complicated example; there are simpler examples in the dict, the most obvious being the singular-plural distinction: so Js, Os, Ss, SIs are never mixed into disjuncts that have Jp, Op, Sp, SIp in them, thus forcing long-range agreement on singular-plural, across many different connector types, and thus across longer spans of links.

linas commented 4 years ago

I'll try that experiment now.

Heh. Already noted as a problem: https://github.com/opencog/link-grammar/blob/f012bfb1ee4111ae784f832daa55e5097cbe378e/data/en/4.0.dict.m4#L132-L141 which suggests that a good fix won't be easy to find...

ampli commented 4 years ago

In general, I don't really like the global context, because, in general, whether something is allowed or not depends on the local context. The above provides an example of local context: was the (R+ & B+) in a disjunct with J- or with O- or with something else?

I used the following rule: <fxlink>: [MVp+]-1.65 & R+ & Bs+; which means "insert MVp before any R+ & Bs+ In principle I can extend this to something like: <fxlink>: MVp+ & J- & R+ & Bs+; when the second term (J-) means "not containing this term in the opposite jet". In my current implementation, adding this would be awkward, since I used "connector stream editing", meaning that I insert the needed connectors on the fly (In this case MVp-) while reading connectors from the dict file (In addition I also use disjunct editing to add MVp+ as a shallow connector on the opposite jet, because this cannot be done, in general, buy dict editing).

I used "connector stream editing" because it makes the insertion once per macro, disregarding how many times the macros are used. Manipulating whole expressions need many times more matching/insertion operations.

So in order to implement more restricting rules I can change my implementation to disjunct editing only (a simple change, but with more overhead since it is done per sentence and not mostly on dict read).

BTW, using this rule (as the only rule) generates a lot of disjuncts that are not getting pruned. This causes several percentage slowness on the basic and fixes batch benchmark, but ~30% on the failures benchmark! So using several such rules would cause a significant slowness. Maybe adding a special option for "allow cross links" may be a solution for that. I will also give a second look on my old pruning WIPs in which I encountered fundamental problems, since I think that by now I found solutions to these problems. I hope that a more aggressive pruning will alleviate the slowness caused by an increased dict complexity.

For now I will just submit the non-related improvements that I did in the code that I touched in this WIP.