splitting hyphenated, underlined words

linas commented 7 years ago

I'm having trouble configuring the any language to ... do what it does today, but also split up hyphenated words: e.g. the split this aaa-bbb ccc into five words: this aaa - bbb ccc. I set REGPRE, REGALT, and so on in various ways, but nothing would quite match correctly ...

ampli commented 7 years ago

Note that the current any/4.0.regex is broken (missing enclosing [] for proper character class in 2 places + requiring at least 2 letters for ANY-WORD instead allowing one too).

I propose to restore it to the previous one (before this last change) so any will be again functional:

ANY-WORD:  /^[[:alnum:]_'-]+$/
ANY-PUNCT:  /^[[:punct:]]+$/

The current splitter cannot split at desired characters, since it doesn't have the concept of splitting on middle characters. Adding that ability has been proposed in #42:

... I would like to introduce an affix-class WORDSEP, which will be a list of characters to be used as a word separator, when blank would be the default. Character listed there will still be able to be listed in other affix classes and thus serve as tokens.

(BTW, it seems blank will still need special handling to collapse it to one character and not being considered as a token, unless additional syntax is added.)

I can implement this if desired.

For amy, "'" and "-" splitting into 3 components seems to work, because REGMID is currently ".*", which accepts '" and "-":

linkparser> I-m
...
    +------------ANY------------+
    |            +------LL------+
    |            |              |
LEFT-WALL I[!MOR-STEM].= =-m[!MOR-SUFF] 

Press RETURN for the next linkage.
linkparser> 
    Linkage 3, cost vector = (CORP=0.0000 UNUSED=0 DIS= 0.00 LEN=0)

    +------------ANY------------+
    |            +------LL------+
    |            |              |
LEFT-WALL I-[!MOR-STEM].= =m[!MOR-SUFF] 

Press RETURN for the next linkage.
linkparser> 
    Linkage 4, cost vector = (CORP=0.0000 UNUSED=0 DIS= 0.00 LEN=0)

    +-------------------ANY------------------+
    |           +------PL------+------LL-----+
    |           |              |             |
LEFT-WALL I=[!MOR-PREF] -[!MOR-STEM].= =m[!MOR-SUFF]

(The same for "'".) However, the infixmark marking is now strange. What is the desired marking? (I guess that if WORDSEP is implemented this marking problem will get auto-solved). (The MOR-STEM in -[!MOR-STEM].= is not a problem since it can be set to SEP, for example, by regex definitions, see below the example for ady).

For ady this doesn't work due to its limit to two splits (aaa-bbb requires a split into 3 parts). A partial work around can be implemented without a source-code change by allowing 3 splits only of the middle morpheme is "'" or "-" , as outlined in:

SIMPLE-STEM: /^[[:alnum:]_'-]+.=/
SEP: /^(-|').=$/
SIMPLE-SUFF: /[[:alnum:]_'-]+$/

"w|ts|wpts" : SANEMORPHISM+; # wpts added
"^-$" "^'$": REGMID+;

However, the result is even stranger than for amy.

To sum up: I can implement WORDSEP if this concept sounds fine. It will also solve the -- separation problem (see defect 28 in #50) and similar things.

linas commented 7 years ago

I tried this, and it almost works, but not quite: in 4.0.affix:

2: REGPARTS+;
2 2: REGALTS+;
"w|ts" : SANEMORPHISM+;

"[[:alnum:]]+-" : REGPRE+;
".+": REGSUF+;

in 4.0.regex:

ANY-WORD:  /^[[:alnum:]']+$/
ANY-PUNCT:  /^[[:punct:]]+$/
SIMPLE-STEM: /^[[:alnum:]]+-/
SIMPLE-SUFF: /[[:alnum:]]+$/

in 4.0-dict:

ANY-WORD SIMPLE-STEM SIMPLE-SUF: {@ANY-} & {@ANY+};

The hope was that SIMPLE-STEM aka REGPRE always ends with a dash. It does sometimes, but not always. Changing the dash to [_-] so that I can match dash or underbar, results in regexes that fail to compile, and I don't understand why.

I was hoping that having the stem end in a dash would then allow regular affix processing to strip off the dash.

linas commented 7 years ago

Regarding WORDSEP: with regular whitespace, its OK to collapse multiple whitespace down to one. Also, whitespace is never a token, in itself. But for dashes, that would not be the case--multiple dashes cannot be collapsed, and dashes are tokens. So I don't understand the concept.

ampli commented 7 years ago

Regarding WORDSEP: with regular whitespace, its OK to collapse multiple whitespace down to one. Also, whitespace is never a token, in itself. But for dashes, that would not be the case--multiple dashes cannot be collapsed, and dashes are tokens. So I don't understand the concept.

The idea for WORDSEP is to be like LPUNC or RPUNC (maybe it can be called MPUNC instead).

(BTW, it seems blank will still need special handling to collapse it to one character and not being considered as a token, unless additional syntax is added.)

Hence blanks cannot be handled by it. In the same occasion of adding WORDSEP (or MPUNC) I can also add WHITESPACE definition.

linas commented 7 years ago

Yes, could you implement MPUNC, if it's easy? To work like LPUNC, it would be a space-separated list, with some of them being multi-character, e.g. the double-dash.

ampli commented 7 years ago

SIMPLE-STEM: /^[[:alnum:]]+-/

The problem is that part1-part2 are insterted in the wordgraph as part1-.= =part2, so part1-.= doesn't actually ends with a - (the . is actually SUBSCRIPT_MARK).

Maybe there is a way to right-strip -^C=, but this will leave an unmarked "stem" and a bogus separator.

Yes, could you implement MPUNC, if it's easy? To work like LPUNC, it would be a space-separated list, with some of them being multi-character, e.g. the double-dash.

I will try to do that.

ampli commented 7 years ago

Changing the dash to [_-] so that I can match dash or underbar, results in regexes that fail to compile, and I don't understand why.

The reason is the idiom checking in dictionary.c:78. (The said regex error causes a SEGFAULT in the regex library due to regfree(NULL), which I fixed.)

I didn't find a trivial fix for this the "underbar in regex" problem. I asimilar problem of "dot in regex" exists too (the dot is patched to SUBSCRIPT_MARK), but it was easy to bypass (anysplit.c contains a hack for that).

Possible solutions for the affix file "underbar in regex" problem:

Disregard '_' in quoted strings. But because quotes are removed very early, we currently cannot detect that in contains_underbar(). A solution may be to add a flag field in Dict_node_struct to signify "quoted string". But this will significantly enlarge it because it is currently exactly 64 bytes. (A flag field there may have another use - to support a new dictionary syntax #define MACRO value instead of the current hacky usage of dict entries for version etc.).
Use the connector direction to indicate if the string needs a processing. Normally it will be '-' (instead of the current '+') but for units it will be '+' to indicate underbar processing. But this is a change in file format.
Support using a backslash before underbar, to be used only for regex. This is easy to implement, and can be extended to quote other characters (no need for now).

I will implement number (3) for now, unless you have a better idea for that.

(BTW, the MPUNC change is almost ready.)

ampli commented 7 years ago

Which of any/ady/amy should handle LPUNC/MPUNC/RPUN separation? (Currently RPUNC/LPUNC is defined only for any.)

linas commented 7 years ago

Which of any/ady/amy should handle LPUNC/MPUNC/RPUN separation?

Not sure. I am using any very heavily, and need it there, mostly because my input texts love to use crazy punctuation. For example

PLATE X.—_Bedroom, by De Vries “Cubiculum.”_ ]

Beats me where the open-square-bracket went. The thing after X. is some unicode long-dash. The underscores are supposed to be some kind of quoting-like device, because the quotes are already in use for a different purpose. So I think I was seeing X.—_Bedroom come out as a single word, when clearly it should not be.

Its all very confusing. In the long-term, maybe ady/amy could discover punctuation on thier own, but this is still far off in the future.

ampli commented 6 years ago

I essentially finished adding MPUNC, but there are still small details that need handling, especially repeated tokenizations due to punctuation stripping (an old problem that becomes more severe). I have just sent PR #564. I would like to try to use its token position tracking infrastructure to prevent repeated tokenization of the same word regions.

ampli commented 6 years ago

I continue to work on MPUNC, and encountered a need to change a current behaviour. Currently, all the tokens are classified by regexes, when the last resort is JUNK. However, if we would like to use the general working idea of the current tokenizer, tokens that match a regex are not further handled, a thing that would prevent further splitting of certain strings that contain punctuation in them (since the splitter repeatable splits tokens), e.g. X.—_Bedroom.

For example only, here is may happen with X.— (a substring of the example sentence) when — is defined as MPUNC (but not LPUNC/RPUNC):

linkparser> X.—
...
Found 1 linkage (1 had no P.P. violations)
    Unique linkage, cost vector = (UNUSED=0 DIS= 0.00 LEN=0)

    +---ANY--+
    |        |
LEFT-WALL X.—[?]

(Of course, we could minimize such cases, if needed, by defining more tokens as LPUNC and RPUNC so they will get split too).

[BTW, a general tokenizer can just split on any punctuation, and have configurable punctuation list on which to emit a split alternative (if they can be also treated as letters). In regular languages the dict will decide which combination is valid. Currently the English dict cannot cope with that, as it expects some nonsense punctuation combination as correct.]

linas commented 6 years ago

Accepting X.— as a single word is acceptable.

Here's another example, for Chinese. Chinese is normally written without any spaces at all between words. There are end-of-sentence markers. Words may be one, two or three (or rarely, more) hanzi (kanji) in length. A single sentence can be anywhere from half-a-dozen to several dozen hanzi's in length. There are three strategies for dealing with this:

1) Run some other, outside program that splits sentences into words. These exist, even open-source, but are in the 85% accuracy range, and so seem to be a source of error.

2) Split between all hanzi, always, (i.e. place a space between all hanzi, which could be done trivially by an external tool) and allow the LG dictionary to reassemble words based on linkages between single hanzi (i.e. the LG dict contains ONLY single hanzi)

3) Split into words, using the LG dict to guide splitting. That is, the LG dict will contain words (which might be 1,2 or 3 hanzi long), and the splitter uses this as guidance.

I don't like 1. I'm not clear on whether 2 or 3 is preferable.

Option 2 pushes all the complexity into the dictionary, and depends on the parser for performance.

Option 3 pushes some of the complexity into the splitter. It makes the splitter more complex, more cpu-hungry, while making the parser run faster.

Option 2 has a dictionary with some 20K hanzi, but with lots and lots of morpheme-style disjuncts, so requires the parser to work through many combinations.

Option 3 has a dictionary with some 100K or 200K words, but each has far fewer disjuncts on them, making parsing faster (but splitting slower)

I don't know which approach is better, either performance-wise, or philosophically-theoretically-speaking.

Back to English: So, the options for handling X.— are kind of like 2 vs 3 above: we can either split it into many parts (option 2), or we can have a dict entry for it (option 3), with the dict entry being a regex.

What's your opinion? I think you are starting to realize how complex splitting can be; is that a good thing, or a bad thing? how can we balance between 2 & 3? You can ponder what might happen if we did Hebrew in 2 vs 3, (i.e. splitting on every letter in Hebrew) and what might happen if we did English/Frechh/Russian using option 2.

ampli commented 6 years ago

A clarification regarding the X.— example: Currently the any/ady/amy use the JUNK regex to classify anything which is not suffix/stem/prefix. If we would like to split at punctuation, we cannot match as JUNK words with punctuation. If such words remain unsplit (due to an inadequate definitions of LPUNC/MPUNC/RPUNC) they will then classified as unknown-word (marked with [?]). But we can always fix that (if at all needed) by adding more punctuation to LPUNC/MPUNC/RPUNC. Another way is to add code for splitting of regex, so we can split on [[:punc:]] if desired (exceptions can use the negative regex notation).

For the rest I will start at the end. I already tested a single-letter-linkage English dictionary. To that end, I have a Perl script that translates en/4.0.dict to a single-letter-linkage dictionary. It can generate two kinds of such dictionaries:

Need word separators (i.e. whitespace is a token).
No need for word separators.

The second kind of dictionary can get a sentence without whitespace and infer the words when creating a linkage. It was extremely slow due to the extremely large number of disjuncts (so I made actual tests on the tiny dict). However, maybe a special purge algorithm for that case (as we once discussed) may solve much of the slowness.

BTW, the regex tokenizer (still in the code but with no test hook any more) can do the same using regular dictionaries, i.e. infer the word boundaries of sentences without white space. A specially-tailored tokenizer code can do it very fast.

Of course, every ordinary dictionary (including the current Hebrew, Russian etc.) can be translated to a single-letter dictionary. In addition, with slight extensions, even 4.0.regex can be translated to a single-letter dictionary!

Option 3 has a dictionary with some 100K or 200K words, but each has far fewer disjuncts on them, making parsing faster (but splitting slower)

I think an efficient tokenizer can be written easily enough for option 3. Depending on the internal (in-memory) representation of the dict, it can even be extremely efficient.

ampli commented 6 years ago

Here are the results of my current any/ady/amy code:

$ link-parser any -m
...
linkparser> PLATE X.—_Bedroom, by De Vries “Cubiculum.”_ ]
link-grammar: Warning: Combinatorial explosion! nulls=0 cnt=2147483647
Consider retrying the parse with the max allowed disjunct cost set lower.
At the command line, use !cost-max
Found 2147483647 linkages (1000 of 1000 random linkages had no P.P. violations)
    Linkage 1, cost vector = (UNUSED=0 DIS= 0.00 LEN=5)

    +-------------ANY------------+------------ANY-----------+--------------ANY----
    +-----ANY-----+------ANY-----+            +-----ANY-----+-----ANY-----+       
    |             |              |            |             |             |       
LEFT-WALL PLATE[!ANY-WORD] X[!ANY-WORD] .[!ANY-PUNCT] —[!ANY-PUNCT] _[!ANY-PUNCT] 

---------+               +----------------------------ANY---------------------------+-------
         +------ANY------+-----ANY-----+-----ANY-----+------ANY------+              |       
         |               |             |             |               |              |       
Bedroom[!ANY-WORD] ,[!ANY-PUNCT] by[!ANY-WORD] De[!ANY-WORD] Vries[!ANY-WORD] “[!ANY-PUNCT] 

---------ANY---------------+             +------------ANY------------+
          +-------ANY------+-----ANY-----+-----ANY-----+-----ANY-----+
          |                |             |             |             |
Cubiculum[!ANY-WORD] .[!ANY-PUNCT] ”[!ANY-PUNCT] _[!ANY-PUNCT] ][!ANY-PUNCT]

I also modified the definitions for ady/amy in the same way, and added MPUNC to en/4.0.affix. Now I need to make the last tests and polish.

ampli commented 6 years ago

Handled in #575.

opencog / link-grammar

splitting hyphenated, underlined words #560