opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
388 stars 117 forks source link

Idioms #1200

Open ampli opened 3 years ago

ampli commented 3 years ago

In PR #1194 I said:

There are some more things to discuss on idioms., to be done here or in a new issue.

Currently, a word with underbars is considered to be an idiom. Because I once wanted to enable debugging of idioms using !!word*, I modified the dictionary reading to also add the idiom dictionary strings as they are.

I add code to provide a way to define a non-idiom word with underbars, so for example word1_word2 would get recognized only in this form (and not with whitespace instead of underbars). The idea was to allow to backslash-escape underbars (in the dict) so they would not be recognized as idiom word delimiters. But this code is incomplete, as it doesn't remove the backslashes... so it is not useful.

I have a fix for that (~20 LOCs). It can load a very big list of words with escaped underbars in a fraction of time of loading the same words as idioms. This is not very important, the method of backslash-escaping has limitations, and the idiom code needs a revision anyway, but I would like to send a PR for that anyway.

EDIT: Please see my 3rd post here for maybe a better proposal.

ampli commented 3 years ago

Another problem that is related to idioms is their dict notation: Connectors that start with ID are reserved for idioms. This causes a problem for automatic dict generation in which you need to skip using these letters when generating connector names. In addition, the subscript .I is reserved for idioms (previously only when numbered, but now only when unnumbered...).

To solve both of these problems and yet another one, I propose to use an initial underbar. The idea is to allow it in a connector base name (in addition to uppercase letters). The connectors for idiom words would then start with an underbar, e.g. _ID... or maybe better _I.... As with variable symbols, an initial underbar will be reserved for "system use". A similar solution can be applied for idiom word subscripts, i.e. to use ._I.

This proposal would solve two other things:

  1. In language learning dictionaries, when there is a need to compose connector names, there is no natural character to serve as a separator. An underbar can serve as such a separator.
  2. In the vn dict an underbar was used in the connector base names as a prefix separator. In order to make this dict readable, I have replaced it with the letter U, a thing that made them somewhat unreadable. With this proposal, the original underbars could be restored.

The only drawback of this proposal (that I can think of) is that it is not backward compatible in case someone uses a program that analyzes link names (and search for ID) or would try to understand old code using new link-grammar documents (unless they mention the old conventions).

ampli commented 3 years ago

The idea was to allow to backslash-escape underbars (in the dict) so they would not be recognized as idiom word delimiters

The dict syntax already contains a way to escape special-meaning characters: enclosing them in double quotes. However, the idiom code doesn't make use of them. instead of my proposal to enable using backslash escapes, it is possible to modify the dict handling code to honor double quoting also for underscores. The implementation details are that unquoted underbars would be internally converted to '\0x02' (^B). This will solve problems that my first proposal cannot solve.

linas commented 3 years ago

double quotes

Yes, I guess that is better.

The connectors for idiom words would then start with an underbar, e.g. _I...

Yeah, I think I like this idea.

ampli commented 3 years ago

double quotes

I started to implement general character quoting using double-quotes but then realized it is not good enough because only full word double-quoting is allowed. Allowing to double-quote parts of words would be a mess. So I will return to backslash quoting.

ampli commented 3 years ago

The benefits of allowing escaping special characters in the dict:

  1. It allows defining any string as a word ( even a\.b when .b is not a subscript).
  2. If implemented correctly, it allows any input. For example, <...> in the input will never match a macro.
  3. When I started to extend the !! command, I changed the dict reading code to insert idioms to the dict. But this has an unintended side effect of accepting the idiom in the input also in its underbar form (and now it seems somebody uses this misfeature). Implementing underbar escaping will allow defining words with underbars and will not confuse them with idioms even if idioms are inserted into the dict.

Since such escaping is not needed just now, will stop the work on it until it is needed for something.

I started to think about that when I tried to handle this line in the dict: % as.#same-as: [[the_same_as]0.05]colloquial; The word combination same_as is defined as an idiom, but I guess that the string the_same_as consists of the 3 regular words and not the idiom and the word as. I changed its syntax to: as.#same-as: [[the,same,as]0.05]colloquial; This way it is possible allows also to specify idioms if needed.

I have implemented the dict reading code of such constructs, and the next step would be to add it as an alternative. However, I stopped the work on this branch because I didn't know how to represent/display such a "fix", as it consists of multiple words and thus I cannot attach a subscript to it.

linas commented 3 years ago

The conventional way of representing such a substitution is to place square brackets around the added words:

This is the convention used in books and papers on linguistics. It's perhaps not quite suitable of LG, as it does not indicate teh scope of the substitution. Thus, perhaps this would be better:

The coffee tastes {the same as.e-c}.#same-as it did last week

i.e.


    +--------->WV--------->+--------MVz-------+
    +----->Wd-----+        +----MVy-----+     +---------CV------>+-----MVpn-----+
    |      +-Ds**c+--Ss*s--+      +IDBWM+     +----Cc-------+-Ss-+       +--DTi-+
    |      |      |        |      |     |     |             |    |       |      |
LEFT-WALL the coffee.s tastes.v {the  same as.e-c}.#same-as it did.v-d last.a week.r

I picked curly braces because they are not otherwise commonly used in English text. There are also a bunch of oddball utf-8 parenthesis-like things, commonly seen only in Chinese, Japanese text. These: « 〈 ( 〔 《 【 [『 「 .. well, the first one is the quote mark in French...

So printing is not a problem. More difficult is "what's the API for this?" I don't think we ever came up with an API for tokenization, to say seasand was split into seas and .. so this is kind of like that, but at a different level in the processing.