Parsing numbers with space-delimited digit groups

ampli commented 6 years ago

The following test works (but it needs a definition modification in order to actually be used): 4.0.regex:

NUMBER-GRP-g0: /^[1-9][0-9]?$/
NUMBER-GRP:    /^[0-9]{3}$/

4.0.dict:

1.g0 NUMBER-GRP-g0: ZZNS+ or NUMBERS;
NUMBER-GRP: {ZZNS- or ZZN-} & ZZN+;
NUMBER-GRP.num: (ZZNS- or ZZN-) & NUMBERS;

% Adding NUMBER-GRP-g0 to existing definitions.

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
29 30 31:
NUMBERS or TM- or [[G+]] or NUMBER-GRP-g0;

32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
80 81 82 83 84 85 86 87 88 89
90 91 92 93 94 95 96 97 98 99:
  NUMBERS or <date-id> or [[G+]] or NUMBER-GRP-g0;

Example:

linkparser> It costs 10 100 000 dollars
Found 12 linkages (6 had no P.P. violations)
    Linkage 1, cost vector = (UNUSED=0 DIS= 0.00 LEN=10)

    +---->WV--->+------------------------Op-----------------------+
    +->Wd--+-Ss-+     +---ZZNS--+--------ZZN-------+------NIn-----+
    |      |    |     |         |                  |              |
LEFT-WALL it costs.v 10 100[!NUMBER-GRP] 000[!NUMBER-GRP].num dollars.c

It cannot be used as is because the regexes above appear before YEAR-DATE. This can be fixed (merging YEAR-DATE in the definitions above).

BTW, I don't like this YEAR-DATE usage:

Found 12 linkages (6 had no P.P. violations)
    Linkage 1, cost vector = (UNUSED=0 DIS= 0.00 LEN=6)

    +---->WV--->+-----------Op-----------+
    +->Wd--+-Ss-+           +-----NIn----+
    |      |    |           |            |
LEFT-WALL it costs.v 300[!YEAR-DATE] dollars.c

I think this should be fixed. It is not trivial (a library infrastructure change is needed) but it will solve many other problems (I started to construct a list of them).

Regarding the test which is the subject of this post, I first tried to implement the following shortcut, but it didn't work (note the use of <2-31>)

1.g0 <2-31>.g0 NUMBER-GRP-g0: ZZNS+ or NUMBERS;

<2-31>: NUMBERS or TM- or [[G+]] ;
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
29 30 31: <2-31>;

The problem seems to be that a lookup of <2-31> while reading the dict doesn't fetch from some reason also <2-31>.g0 (OTOH, using !!<2-31> fetches also <2-31>.g0). I don't know yet if this can be fixed.

This also doesn't work: NUMBER-GRP.g0: /^[1-9][0-9]?$/ The reason here is that lookup of a subscripted RegEx label is not supported. (RegEx labels have another problem - they cannot occur in sentences. And of course this can be fixed.)

I think that limitations like the above are not good and generally limit what can be done. I would like to remove such limitations (there are more) while keeping compatibility.

ampli commented 6 years ago

It turns out this currently has other problems, like producing similar parsings that are unneeded, and that their number grows exponentially with the number of words that are duplicated this way:

      +------Op-----+
 +-Ss-+      +--NIn-+
 |    |      |      |
it costs.v 1.g0 dollar.c

Press RETURN for the next linkage.
linkparser> 
    Linkage 6, cost vector = (UNUSED=0 DIS= 3.00 LEN=5)

      +----Op----+
 +-Ss-+    +-NIm-+
 |    |    |     |
it costs.v 1 dollar.c

This can of course be fixed, but an extension is needed (e.g. token priority postprocessing).

linas commented 6 years ago

The problem with an exponential number of alternatives occurs in a large variety of places. It is very tedious to fix those cases by hand; this is one reason why unsupervised language learning is interesting.

I don't understand what the point of NUMBER-GRP-g0 is. You seem to be using it inconsistently in the above.

There is nothing special about the use of the angle-brackets. I don't understand why you want to have them subscripted. Is it to allow multiple definitions?

In the language-learning code, it sees very unlikely that we will be using subscripts in any meaningful way. I don't see much point in investing a lot of time in them.

linas commented 6 years ago

It does seem useful, however, to have some kind of formalized re-writing or transformation system. Thus, for example, 10 000 could be re-written to not have a space, or 10,000 could be re-written to not have a comma. This resembles the effort to have automatic spelling corrections, except that this is now deployed on multiple tokens.

ampli commented 6 years ago

It is very tedious to fix those cases by hand;

Yes. For my example case above I already started to think how to fix it to prevent these duplicates...

I don't understand what the point of NUMBER-GRP-g0 is. You seem to be using it inconsistently in the above.

The first digit group can consist of 1 or 2 digits only. This is matched by NUMBER-GRP-g0. NUMBER-GRP matches any 3-digit group, including a first one.

There is nothing special about the use of the angle-brackets. I don't understand why you want to have them subscripted. Is it to allow multiple definitions?

Yes. Else multiple definitions are not allowed. Such multiple definitions are needed in order to simplify the definition and make it more readable/manageable, as I demonstrated in (currently not working as intended):

1.g0 <2-31>.g0 NUMBER-GRP-g0: ZZNS+ or NUMBERS;

<2-31>: NUMBERS or TM- or [[G+]] ;
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
29 30 31: <2-31>;

The alternatives are not so appealing, here is a particularly bad one:

1.g0 2.g0 3.g0 ... 30.g0 31.g0 NUMBER-GRP-g0: ZZNS+ or NUMBERS;

(instead of the ... above all the rest of the numbers are to be written...).

In the language-learning code, it sees very unlikely that we will be using subscripts in any meaningful way. I don't see much point in investing a lot of time in them.

If you are sure enough that a language-learning code will soon generate a dict that outperforms a manually written one for arbitrary languages (including a dict that is manually written in a more expressive syntax than now, and including morphology of words, numbers, symbol and punctuation split, etc.) than maybe there is no much point to make the current library more sophisticated.

However, I cannot imagine how a more sophisticated library, but still compatible, may cause damages, especially that it is hard to know how useful are things that are not implemented yet. For example, feedback infrastructure can still be useful, and tokenization only according to the dict info seems to me essential (both are in my to-do list).

ampli commented 6 years ago

Thus, for example, 10 000 could be re-written to not have a space, or 10,000 could be re-written to not have a comma.

Input preprocessing can be always done out of the library, and in a much more sophisticated ways. But what you suggest allows to reconstruct the original tokens in the parsed output or get the output token positions in the unmodified original sentence.

linas commented 6 years ago

If you are sure enough that a language-learning code will soon generate a dict that outperforms a manually written one for arbitrary languages

I wish. Not soon. I am being interrupted at far too high a rate to be able to concentrate and get anything done. I have not been able to get anything done for half a year, now :-(

(including a dict that is manually written in a more expressive syntax than now,

No the syntax is the same as now. I'm using the sqlite3 DB which is more or less the same as the ascii version.

and including morphology of words, numbers, symbol and punctuation split, etc.)

morphology of words is very different than automatically learning what a generic number is. I've started with morphology, but have given no thoughts to numbers.

than maybe there is no much point to make the current library more sophisticated.

Maybe there is a point. Sorry. I'm constantly struggling to figure out how to avoid wasting my time on things that are not important.

tokenization only according to the dict info seems to me essential.

Yeah. Open an issue that describes what you are thinking, here.

linas commented 6 years ago

Thus, for example, 10 000 could be re-written to not have a space, or 10,000 could be re-written to not have a comma.

Input preprocessing can be always done out of the library, and in a much more sophisticated ways. But what you suggest allows to reconstruct the original tokens in the parsed output or get the output token positions in the unmodified original sentence.

I'm not sure what I am suggesting. Let me try a different example. See https://en.wikipedia.org/wiki/Operator_grammar and the example therein:

 1.   John wears boots; the boots are of leather (two sentences joined by semicolon operator) →
 2.   John wears boots which are of leather (reduction of repeated noun to relative pronoun) →
 3.   John wears boots of leather (omission of high likelihood phrase "which are") →
 4.   John wears leather boots (omission of high likelihood operator "of", transposition of short modifier to left of noun)

I would like to be able to automatically move both backwards and forwards between 1,2,3,4. The hope is to do this by first creating some very abstract representation of the general idea, and then being able to generate various different sentences that say the same thing.

My "plan" (intention) for accomplishing the above involves doing all this abstraction in the atomspace, somehow, which provides very generic (although very heavy-weight) tools for working with generic graph representations. I have not thought through the details of how to do this, other than having a general feeling for how to do it.

The numbers-with-spaces, with-commas, and no-punctuation example is similar, and yet somehow much easier. But it also forces a sequence of unpleasant questions: how much "normalization" of numbers should be done before parsing, vs. how much should be done after? Or should it somehow be done "during parsing"? But if we can do this for numbers, can't we also do this for John's leather boots? How should it fit together?

I don't have any particular answers to any of this. I know what I would do, if I had five lifetimes to spend on this problem, but I don't, so...

ampli commented 6 years ago

(including a dict that is manually written in a more expressive syntax than now,

I mean that if the dict syntax will become more expressive (as I want to make it), it will be much easier to hand-define various languages and much harder to outperform that by automatic language-learning.

morphology of words is very different than automatically learning what a generic number is. I've started with morphology, but have given no thoughts to numbers.

Can the current learning code deduce the need to strip punctuation? Or that lowercase letetrs are very similar to uppercase ones?

linas commented 6 years ago

(including a dict that is manually written in a more expressive syntax than now,

I mean that if the dict syntax will become more expressive (as I want to make it), it will be much easier to hand-define various languages and much harder to outperform that by automatic language-learning.

I can't tell what you are thinking of, and I don't see any particular way of making this possible. The current English dicts are quite the mess, and its hard to see how that can be cleaned up by making the dict syntax "more expressive".

Anyway, I am also convinced that every human understands language differently; what 4.0.dict encodes is some distorted average of what other humans think that language is.

morphology of words is very different than automatically learning what a generic number is. I've started with morphology, but have given no thoughts to numbers.

Can the current learning code deduce the need to strip punctuation?

No. That would be hand-coded in the affix dicts, as it is today.

Or that lowercase letetrs are very similar to uppercase ones?

It already assigns lower and upper-case words to the same classes. It does NOT do any kind of string processing, to detect similarities amongst the strings.

ampli commented 6 years ago

I don't have any particular answers to any of this. I know what I would do, if I had five lifetimes to spend on this problem, but I don't, so...

Of course help from many people is needed in all of that. To that end there is a need to get many programmers involved in that.

My hope is that if we succeed to make the current library much more sophisticated so, for example, it will be useful for English grammar error corrections and also for translations, and I write a vim plugin for it, and provide a good Hebrew support as a proof it is useful for complex languages, and more, than many more people will get involved.

opencog / link-grammar

Parsing numbers with space-delimited digit groups #754