Tokenization and morphology

linas commented 6 years ago

An ongoing discussion of tokenization and its relation to morphlogy is happening in issue #42 It really should be happening here.

ampli commented 6 years ago

[Continuing the discussion from issue #42.)

Hebrew: "LPUNC (WORD=+ | WORD= WORD)? RPUNC*"

First I need to make it clear of what these = marks mean there. Originally (in the Russian dict) they marked from which side you can join the tokens to which a word (sentence string) can be broken. Especially, you cannot join them in the side in which there is no = mark, and there cannot be a white space after/before the = mark.

Now we would like to extend breaking words to more that two tokens.

One "natural" way to enable a word ABC to break into A B C is to have 'A=,=B=and=C'.

However, I found it more useful (for tokenization) to extend it in another way, for which the original one is a private case: Let the = denote where you can cut the word. And this process is done iteratively, one token by one. And the above whitespace rule still apply.

Due to that I marked the Hebrew "prefix tokens" as x= even though they can appear in a middle of a sentence string, because during the tokenization process they are "stripped off" one by one at the marked cut point.

(BTW, any character that cannot happen in a word may be used instead of this =, as long as the same one is used in the dict, and it may be useful to use several different such notations.)

In my posts I mentioned the idea to define in the affix file something like PRENUM_MAX (I most probably used another name for it but I would like to save time by not looking now in old posts). For Russian it is 1, for Hebrew it is 4 or 5 (now defined in the source code as 5), and for Hungarian maybe 2. The same idea applies to =x (at end of words). Note that PRE in PRENUM is not a short for a linguistic prefix, even though for some languages it can be the same.

Especially, I mentioned that I don't like the stem mark .= because it seems to me as a kind of a linguistic mark, while we are dealing with tokenization. In my scheme, stems can be denoted by x=.POS, when POS may be a subscript that denotes a stem.

It still may be that some tokens should be marked as =x= (just show me an example in which x= and =x are not enough, and we can think of the needed extension).

We have never talked about derivational vs inflectional use of morphemes. For the purpose of the current LG, it seems to me there is no need to support breaking into derivational morphemes because the dict (and even the current unsupervised language learning idea) cannot deal with that, as their combination don't necessarily have an exact predefined usage/meaning, by definition.

ampli commented 6 years ago

But it occured to me (and I also posted on that in details) that even these terms are not needed, if we just mark the possible internal links somehow, because the tokenizer can then infer everything from the dict. And as usual, for many things that I said, including this, I also wrote (before I said...) a demo program to check them.

I actually tested that, using a "trivial" tokenizer, basically as follows:

Call anysplit(), when the number of possible splits is big (say >10).
Discard splits for which not all the tokens are as in the dict -- Handle current = marks as expected (maybe not implemented in my demo - need to check). -- Don't allow more than one token that doesn't include punctuation.

The problem of this tokenizer was "extra splitting". For example, 10$ got split to 10 $. As a result, unparsable sentences got strange parses. If $ would be marked as e.g. $=, or the link between $ and NUMBER would be marked as "internal link", then this extra split could be avoided. Note that $ and NUMBER may have a link that is marked to be both external and internal (or OR of two link types) if supporting $ 10 (with a blank) is desired. When reading the dict, the tokens can be marked according to their possible type of linkage (but I have never tried that). Using link types is more flexible, but reading the dict and marking the words for tokenization can be tricky.

Such a simple implementation has an heavy overhead. An efficient implementation is much more complex.

EDIT: Fix a markdown error that caused a problem.

linas commented 6 years ago

I just added the following text to the README, it seems relevant:

Handling zero/phantom words as re-write rules.

A more principled approach to fixing the phantom-word issue is to borrow the diea of re-writing from the theory of operator grammar. That is, certain phrases and constructions can be (should be) re-written into thier "proper form", prior to parsing. The re-writing step would insert the missing words, then the parsing proceeds. One appeal of such an approach is that re-writing can also handle other "annoying" phenomena, such as typos (missing apostrophes, e.g. "lets" vs. "let's", "its" vs. "its") as well as multi-word rewrites (e.g. "let's" vs. "let us", or "it's" vs. "it is").

Exactly how to implement this is unclear. However, it seems to open the door to more abstract, semantic analysis. Thus, for example, in Meaning-Text Theory (MTT), one must move betweeen SSynt to DSynt structures. Such changes require a graph re-write from the surface syntax parse (e.g. provided by link-grammar) to the deep-syntactic structure. By contrast, handling phantom words by graph re-writing prior to parsing inverts the order of processing. This suggests that a more holistic approach is needed to graph rewriting: it must somehow be performed "during" parsing, so that parsing can both guide the insertion of the phantom words, and, simultanously guide the deep syntactic rewrites.

Another interesting possibility arises with regards to tokenization. The current tokenizer is clever, in that it splits not only on whitespace, but can also strip off prefixes, suffixes, and perform certain limited kinds of morphological splitting. That is, it currently has the ability to re-write single-words into sequences of words. It currently does so in a conservative manner; the letters that compose a word are preserved, with a few exceptions, such as making spelling correction suggestions. The above considerations suggest that the boundary between tokenization and parsing needs to become both more fluid, and more tightly coupled.

ampli commented 6 years ago

This suggests that a more holistic approach is needed to graph rewriting: it must somehow be performed "during" parsing, so that parsing can both guide the insertion of the phantom words

There is some problem with that and the classic parsing: The classic parsing is done by attempting to parse ranges of words. Even for perfectly correct sentences, absolutely most of these ranges are unparsable, but often they can be corrected by inserting some words...

The above considerations suggest that the boundary between tokenization and parsing needs to become both more fluid, and more tightly coupled.

In my to-do list is providing a feedback infrastructure.

linas commented 6 years ago

often they can be corrected by inserting some words...

This has to somehow be done probabilistically -- one has to recognize likely words to insert, and not waste time trying all possibilities, most of which would be absurdly unlikely. Hmm. Like the punch-line of a joke is unexpected....

Solving this will require some kind of deep theoretical insight or inspiration. I've got only the vaguest glimmer right now.

ampli commented 6 years ago

Say we have a perfectly correct sentence. Suppose that the the parase explores 10M ranges, and 99% of them are unparsable. So we have about 10M unparsable ranges, and in many of them a "likely word" will be able to make them parsable. We will get any fix suggestions, to a sentence that is not even incorrect. The problem is that each parse of a limited word range doesn't see the whole picture.

opencog / link-grammar

Tokenization and morphology #700

Handling zero/phantom words as re-write rules.