opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
388 stars 118 forks source link

AMY language difficulties #1315

Closed linas closed 2 years ago

linas commented 2 years ago

This recaps posts made on pull req #1312 -- basically AMY works great with some character sets, but not others.

Amy works great on the following

Fails giving one parse only:

Anything that uses the sanskrit or related writing system has a count overflow, e.g.

The Thai phraset นี่คือการทดสอบประโยคที่ยาวขึ้น does come up [!JUNK] on the first linkage, and it never uses the LL link type, whereas the prior three do use the LL link type. On very rare occasions, the = sign prints weird:

linkparser> 
    Linkage 30, cost vector = (UNUSED=0 DIS= 0.00 LEN=1)

            +--------------ANY-------------+
    +--ANY--+              +------ANY------+
    |       |              |               |
LEFT-WALL นี่=[?] คือการทดสอบประโยคท[?].= =ี่ยาวขึ้น[?]

... although the above is not how it prints in my terminal window. Heh. I'm guessing that amy chose to insert a break where breaks are not allowed. The =ี่ seems to be some kind of character modifier and I think there must never be a break there.

The sanskrit variant इदं दीर्घतरवाक्यस्य परीक्षा अस्त gives a combinatorial explosion, which is unexpected:

linkparser> इदं दीर्घतरवाक्यस्य परीक्षा अस्ति
link-grammar: Warning: Combinatorial explosion! nulls=0 cnt=2147483647
Consider retrying the parse with the max allowed disjunct cost set lower.
At the command line, use !cost-max
Found 2147483647 linkages (301 of 301 random linkages had no P.P. violations)
    Linkage 1, cost vector = (UNUSED=0 DIS= 0.00 LEN=1)

    +-----------ANY----------+
    |         +------ANY-----+-------ANY------+-----ANY----+
    |         |              |                |            |
LEFT-WALL इदं[!JUNK] दीर्घतरवाक्यस्य[!JUNK] परीक्षा[!JUNK] अस्ति[!JUNK]

Press RETURN for the next linkage.
linkparser> 
    Linkage 2, cost vector = (UNUSED=0 DIS= 0.00 LEN=2)

    +---------ANY---------+----------ANY---------+
    |         +----ANY----+----ANY----+          +-----ANY----+
    |         |           |           |          |            |
LEFT-WALL इदं[!JUNK] दीर्घतरवा[?].= =क्यस्य[?] परीक्षा[!JUNK] अस्ति[!JUNK]

and again, the LL link is never used. Surprising.

linas commented 2 years ago

Oh, duhh, this is a regex issue. But of course! On my system, configure prints

Regex library:                  -lpcre2-8

In data/amy/4.0.regex tried replacing ANY-WORD: /^[[:alnum:]']+$/ with the following

but none of these work.

The file dict-common/regex-morph.c is using PCRE2_UTF|PCRE2_UCP so that seems correct ...

ampli commented 2 years ago

It seems there is a bug in anysplit.c. I will investigate that.

ampli commented 2 years ago

The problem is caused by breaking grapheme clusters when doing the splits. As you can see, the rules are not too simple. Since PCRE2 knows these rules, I'm trying to define REGPRE, REGMID, and REGSUF to match only valid sequences (without a success for now, the reason is still unknown).

linas commented 2 years ago

Huh! Graphemes! But of course! If/when you get this working, can you put that URL into an appropriate README or into a code comment?

ampli commented 2 years ago

If/when you get this working, can you put that URL into an appropriate README or into a code comment?

OK.

I'm making now tests so I guess the PR will be ready only tomorrow.

Until then, here are the regexes, so you can test them too if you like: 4.0.affix:

"^(?:(?=\pL)\X|[[:alnum]]|')+$" : REGPRE+;
"^(?:(?=\pL)\X|[[:alnum]]|')+$" : REGMID+;
"^(?:(?=\pL)\X|[[:alnum]]|')+$" : REGSUF+;

4.0.regex:

ANY-WORD:  /^(?:(?=\pL)\X|[[:alnum]]|')+$/
ANY-PUNCT:  /^[[:punct:]]+$/

MOR-PREF: /^(?:(?=\pL)\X|[[:alnum]]|')+=$/
MOR-STEM: /^(?:(?=\pL)\X|[[:alnum]]|')+\x03=$/
MOR-SUFF: /^=(?:(?=\pL)\X|[[:alnum]]|')+$/

JUNK: !/[[:punct:]]/
JUNK: /^/

\X matches a grapheme. But it is similar to a dot for ASCII. Since we would like to match only the equivalent of [:alnum:], I added a lookahead of a Unicode letter. To be similar to the original regex, as alternatives I added [:alnum:] and ' (maybe replacing the lookahead with[:alnum:] is better, as there will be no need to add it as an alternative - I will test it).

Also note the \0x03: issue_word_alternative() uses SUBSCRIPT_MARK. You also encountered this problem and used . instead of \. (which depends on MOR-PREF not to match dot).

This makes amy depending on PCRE2, so I added a comment on that in the files.

I don't know if it is needed, but we can leave the old 4.0.regex and 4.0.affix (renamed) in the distribution for languages that don't need grapheme detection (this is also faster). In any case, we can recommend to configure with PCRE2 (it is also faster than the alternatives).

BTW, here is how to debug it: link-parser amy -m -v=10 -de=anysplit.c,separate_word,regex-morph.c

linas commented 2 years ago

Can you add the explanation above into the regex file? including the how-to-debug part?

ampli commented 2 years ago

Can you add the explanation above into the regex file? including the how-to-debug part?

I added it. BTW, I didn't change the any configuration files for handling graphemes, but instead added a reference to amy/4.0.regex.

I said above:

for languages that don't need grapheme detection (this is also faster)

Instead of that, I have left the original regexes in a commented-out section, with an appropriate comment.

Huh! Graphemes! But of course! If/when you get this working, can you put that URL into an appropriate README or into a code comment?

I have not found an appropriate README so I added it in a comment in anysplit().

I will also add a trivial test to validate graphemes are not getting broken in amy.

(To be sent after the next PR, which fixes problems related to Python.)

linas commented 2 years ago

Looks like #1320 fixes most of the issues. I'm still seeing:

Also, just as before,

give one parse only, and there are no morpheme splits. However ... by cutting away the double-dot colon-like character, then I do get multiple parses (17 of them)

Likewise, for the Lao, cutting away the initial two characters gives 17 parses:

Note that doing this cchanges the rendering of the final characters as well! So I assume that the intial characters are some kind of grapheme-controlling/modulating characters. Perhaps quote marks of some kind, that inhibit the grapheme processing?

All the other examples above look reasonable. The combinatorial explosions went away, but they still have an unexpectedly large number of parses.

To conclude: Burmese is still broken. However, this might be a pcre2 issue. Also, the way Burmese renders on my terminal looks wrong; its very different from what the browser shows. Perhaps its due to some missing packages?

ampli commented 2 years ago
  • Burmese ဤသည်မှာ ပိုရှည်သောဝါကျတစ်ခု၏ စမ်းသပ်မှုတစ်ခုဖြစ်သည်။

I'm investigating this... The first word ဤသည်မှ (\u1024\u101e\u100a\u103a\u1019\u103e\u102c) causes the problem: \u102c doesn't match in the ANY-WORD regex (the previous ones match).

  • Khmer: នេះគឺជាការសាកល្បងនៃប្រយោគវែងជាងនេះ។
  • Lao: ນີ້ແມ່ນການທົດສອບຂອງປະໂຫຍກທີ່ຍາວກວ່າ

Note that anysplit.c can handle up to 31 Unicode characters. It will be reasonable to change that to 31 graphemes (a change in anysplit.c). If needed, this number can be somewhat enlarged, but not by very much.

ampli commented 2 years ago

This site is very useful for Unicode analysis and more. Here are the graphemes of ဤသည်မှ.

The last one (no. 5) doesn't match my amy Unicode regexes. The best fix may be to make anysplit.c grapheme-aware (not hard). But for now, I will think of better regexes in order to fix this problem.

ampli commented 2 years ago

While I make changes in the regexes, I would like to know if it is still needed to include an apostrophe in them. https://github.com/opencog/link-grammar/blob/38887f69064c439b7d6c7c0b928c4a2d4c8fc8e4/data/amy/4.0.regex#L18-L19 It is now possible to auto-split apostrophes using MPUNC. The original word is left as an alternative. If ' is included in MPUNC, the inclusion of an apostrophe in ANY-WORD is needed only if you would like to also get the original word (i.e. including the apostrophe) as a morpheme.

linas commented 2 years ago

get rid of the special-case treatment for the apostrophe ...

linas commented 2 years ago

FY amusement : gmail has started flagging the emails containing the grapheme examples as possible-spam/possibly-harmful, are you sure?

ampli commented 2 years ago
  • Burmese ဤသည်မှာ ပိုရှည်သောဝါကျတစ်ခု၏ စမ်းသပ်မှုတစ်ခုဖြစ်သည်။
  1. The phrase ဤသည်မှ ends with a grapheme which consists of only a mark character. See number 5 here. So I added this possibility to the regexes.

  2. The phrase ာ ပိုရှည်သောဝါကျတစ်ခု၏ ends with punctuation , named MYANMAR SYMBOL GENITIVE. See here for more details about Burmese punctuation (including this one). Since we don't split on script-specific punctuation, I added "other punctuation" (the Unicode category of this particular punctuation) to the word/part regexes (\pPo). Maybe I should have added additional categories for this or other scripts - I just don't know.

  3. There are many additional Unicode categories. I don't know which of them are valid in words. Maybe we can just accept any non-punctuation character in a word, and split on any punctuation. Such a split will need code to accept something like /\pP/: LPUNC.


A comment regarding the functionality of "amy": The Burmese language is written in space-separated phrases, not words, and there are no spaces in Lao and Khmer (like in some other languages). So "amy" splits such languages into pieces which include several words, often starting and ending with half-words. At first glance at least, this seems unuseful.

linas commented 2 years ago

Maybe I should have added additional categories for this or other scripts - I just don't know. ... So "amy" splits such languages into pieces which include several words, often starting and ending with half-words. At first glance at least, this seems unuseful.

The long-term goal is to automate tokenization, of which #1311 "scratches the surface".

The strategic goal is to also perform this kind of tokenization on audio streams, 2D photos, and other types of data (and I have some vague ideas how to accomplish this, written up elsewhere).

The strategic vision is to be able to perceive entities in arbitrary sensory environments.

So, for the short and medium-term, relying on explicit knowledge of unicode in utf8 byte streams seems just fine, for now. It will be a while before a learning pipeline can be unleashed on Lao.

More narrowly: if some of the morphemes are actually two-and-a-half words, presumably the statistics will notice that, and unweight them. The point of AMY is to provide a kind of uniform statistical sampling (uniform in the space of trees) The actual structure cannot be determined until there have been enough samples.

linas commented 2 years ago

I'm going to close this now, since #1321 appears to resolve all remaining issues.

If you want to re-open, and/or have other things to document, discuss -- please, go ahead!

Many thanks for fixing this; @ampli, you're awesome! This stuff wouldn't work without you!