opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
388 stars 118 forks source link

echo "storeDiagramString:true, text: But as usual,we couldn't make it stick." | nc localhost 9000 #743

Closed cdmalcl closed 6 years ago

cdmalcl commented 6 years ago
    +----------------------------Xp----------------------------+
    +------------------------->WV------------------------>+    |
    |        +--------------->WV--------------->+---I*j---+    |
    +-->Wc---+------Wdc-----+-----Ss----+---I---+-Osm+    |    +--RW--+
    |        |              |           |       |    |    |    |      |
LEFT-WALL but.ij [as] usual,we[?].n couldn't make.v it stick.v . RIGHT-WALL

usual,we

ampli commented 6 years ago

With the current setup, there is no tokenization of most of the punctuation that don't have white space before/after them. such in usual,we.

However, you can experiment with tokenizing them too. Just add them to the MPUNC definition in en/4.0.affix.

Current definition: -- ‒ – — ― "(" "[": MPUNC+; You can modify it to (added ... and ,): -- ‒ – — ― "(" "[" ... ,: MPUNC+; Note that some tokens need quoting if you add them there, e.g. ":".

linas commented 6 years ago

I just added elipses and commas and semicolons to the MPUNC list; this will be in version 5.5.0 later today.

Closing.

ampli commented 6 years ago

What is the reason of including [ in MPUNC (a previous change) but note also ]' (similarly for(). See for example following fromen/corpus-fixes.batch`:

We looked for 3-Amino-3-azabicyclo[3.3.0]octane hydrochloride

                                          +--------------------MX--------------------+
    +----->WV---->+                       |             +-------------Xd-------------+
    +->Wd--+--Sp--+--MVp--+-------J-------+             |         +---------A--------+------Xc------+
    |      |      |       |               |             |         |                  |              |
LEFT-WALL we looked.v-d for.p 3-Amino-3-azabicyclo[!].n [ 3.3.0]octane[?].a hydrochloride[!].n RIGHT-WALL

(In this particular case, however, the first pasre, which uses the alternative of not splitting the word, may be better.)

ampli commented 6 years ago

Middle-splitting on , has implications on numbers with commas, which may be undesired. For example (from corpus-biolg.batch): The enzyme has a weight of 125,000 to 130,000 In addition to the previous parses like (:

    +-------->WV-------->+
    +----->Wd-----+      +----Os---+     +------MVp------+
    |      +-Ds**v+-Ss*s-+   +Ds**c+--Mf-+      +--NIfn--+---NItn--+
    |      |      |      |   |     |     |      |        |         |
LEFT-WALL the enzyme.s has.v a weight.s of 125,000[!] to.j-ru 130,000[!]

we now get also:

    +-------------------------------Xx-------------------------------+
    +-------->WV-------->+                                           |
    +----->Wd-----+      +----Os---+     +------MVp------+           |
    |      +-Ds**v+-Ss*s-+   +Ds**c+--Mf-+      +--NIfn--+--NItn-+   +>Wa-+
    |      |      |      |   |     |     |      |        |       |   |    |
LEFT-WALL the enzyme.s has.v a weight.s of 125,000[!] to.j-ru 130[!] , 000[!]

which is incorrect.

If Middle-splitting on , is still desired in general, I think this can be mostly solved by treating numbers with commas (and also spaces!) as variable-length idioms (no syntax for that yet but they can be crafted by hand definitions).

The problem of exponentially increase in the number of parses (2**(number of splits)) will still remain even then, unless we find a way for compact representation of parses (you wrote on that need in another context).

Also note that forbidding split for tokens that match regexes was abandoned long ago in the favor of creating alternatives. Returning to the old way will rule out parsing of many useful constructs. What we need is a mechanism for alternative cost, that is independent of a sentence cost, and has a relative cut-off. I encountered this problem in many yet-unsolved LG problems, including spell corrections of words with capital letters, and recently when I tried to investigate corrections like it's.#its: its; that over-correct.

linas commented 6 years ago

Ouch. I added only "[" and not "]" because I was looking at project gutenberg texts, which had foot-note and reference constructions, e.g. "Studies show that this happens[32]." where the [32] is some reference. I will add closing square brackets now.

The addition of the comma is the law of unintended consequences. Perhaps I should remove the comma, for now.

I don't see any easy, obvious solution. The complex solutions all seem to be problematic. We have distinct needs: bad punctuation, spelling errors, European morphology, Hebrew morphology, other (e.g. turkish) morphology. One unified system for all this is possibly not enough.

For spelling errors, single-word substitution is a short-term patch; I did it because its an easy, cheap stunt. The correct long term approach is some kind of transformative grammar; where we can mutate sentences from unusual forms into "standard" forms. Some of that mutation is bad-spelling, bad-grammar related.

ampli commented 6 years ago

The addition of the comma is the law of unintended consequences. Perhaps I should remove the comma, for now.

Most probably.

I don't see any easy, obvious solution. The complex solutions all seem to be problematic. We have distinct needs: bad punctuation, spelling errors, European morphology, Hebrew morphology, other (e.g. turkish) morphology. One unified system for all this is possibly not enough.

Many, if not all, of these things can be done with the current LG, if definition and lookup limitations are removed (e.g. extending idiom definitions to allow <a>_<b>) , constructions and definitions be made orthogonal (so they will not interfere with each other), ambiguity get removed (e.g. regex label on input), explicit rules be used everywhere instead of implicit hard-coded ones (examples of implicit hard-coded rules are not applying regex to tokens in the dict, stopping on the first matching regex, etc., all of that always has bad implications), token splitting rules be defined (or better- derived from the dict) and not hard coded, etc.

Additional important improvement is a general way to detect "virtual morphemes" like (but not limited to) phonology and capitalization and use an appropriate disjuncts for them so everything will be able to be done by the dict, without a need to separate words to different files etc. (like a/an - can be very complex for complex phonology) or adding all capital words with their own rules. Such "virtual morphemes" are exactly like doing such separations -they are just practical shortcuts that are easy to define and modify (I'm working on that).

One unified system for all this is possibly not enough.

I think it is definitely possible.

ampli commented 6 years ago

I fixed a bad formatting of my previous message, so please read it on GitHub...

linas commented 6 years ago

if definition and lookup limitations are removed (e.g. extending idiom definitions to allow _),

OK, that's done, now. Its not obviously terribly useful, unless you have some good example that you haven't told me (or I've forgotten)

constructions and definitions be made orthogonal (so they will not interfere with each other),

Can you provide examples? For this, and each of teh points below, maybe could you open an issue, and place a description of the problem, an example, and a proposed fix?

ambiguity get removed (e.g. regex label on input),

Again, not sure what that means.

explicit rules be used everywhere instead of implicit hard-coded ones (examples of implicit hard-coded rules are not applying regex to tokens in the dict,

What would that do, and how would that help?

stopping on the first matching regex, etc., all of that always has bad implications),

OK. Yes, that seemed useful at the time, but perhaps has outgrown it's utility.

token splitting rules be defined (or better- derived from the dict) and not hard coded, etc.

Yes. Well, the REGPARTS/REGMID/etc was an attempt to do that; I'm not convinced it ever worked very well, but also, we've just barely ever used it. Or are you refering to something else, some replacement, some different way of doing things?

I like the general sentiment; please do this. I'm flat out of bright ideas at this particular moment.

ampli commented 6 years ago

if definition and lookup limitations are removed (e.g. extending idiom definitions to allow _),

OK, that's done, now. Its not obviously terribly useful, unless you have some good example that you haven't told me (or I've forgotten)

I had an markdown problem in my post, and thought I have corrected it. But the post still contains this markdown problem. I tried again to fix it, and now the format is OK.

Anyway, here is the corrected line (I forgot to use backquotes for <a>_<b>):

(e.g. extending idiom definitions to allow <a>_<b>)

I indeed have never updated you about a possible usage of such constructs. I tried to solve the case of collocations with holes. I will open an issue to describe the test I did.

ambiguity get removed (e.g. regex label on input),

Again, not sure what that means.

This doesn't parse:

UNITS is a regex label used for units.
LEFT-WALL [UNITS] is.v a regex.n label.n used.v-d for.p units.n .

This parses fine:

UNITSX (X removed)  is a regex label used for units.
LEFT-WALL UNITSX[!] ( x.n removed.v-d ) is.v a regex.n label.n used.v-d for.p units.n .

I can open an issue with my proposal for solving it.

stopping on the first matching regex, etc., all of that always has bad implications),

OK. Yes, that seemed useful at the time, but perhaps has outgrown it's utility.

This is one of the many things that are not "orthogonal": If you would like to add a regex, like what I did in my test, you interfere with the rest of the regexes.

token splitting rules be defined (or better- derived from the dict) and not hard coded, etc.

Yes. Well, the REGPARTS/REGMID/etc was an attempt to do that; I'm not convinced it ever worked very well, but also, we've just barely ever used it.

Not exactly. Its only intended use is for the 'amy` and the like languages, to denote tokens that should not be produced by random splitting.

Or are you refering to something else, some replacement, some different way of doing things?

Yes. Say we add an ability to denote regex tokens in the xPUNC definitions. For example: -- ‒ – — ― "(" "[" ... "," ";" /(,)[^0-9]/ : MPUNC+; (POSIX regex can return the matching group, which can server for the actual split.)

I like the general sentiment; please do this. I'm flat out of bright ideas at this particular moment.

I have a list of things to implement. I can try to open issues on each of them. But I will need your input on each...

linas commented 6 years ago

UNITS is a regex label used for units.

Yes, please.

/(,)[^0-9]/ : MPUNC+;

Oh, that's interesting!, Yeah, I guess I like that.

But I will need your input on each...

I can try to give it. I only have a finite amount of ability to read, focus and respond, and am already operating at the limit :-)

linas commented 6 years ago

p.s. when should I publish version 5.5.0?

ampli commented 6 years ago

Just now, if you are willing to have the rand fix for the next release only. This is because I will not be able to look at it until tomorrow. Or maybe just wait 24 hours and include this fix too (I suppose I will be able to fix the problem).