Closed linas closed 2 years ago
Oh, duhh, this is a regex issue. But of course! On my system, configure
prints
Regex library: -lpcre2-8
In data/amy/4.0.regex
tried replacing ANY-WORD: /^[[:alnum:]']+$/
with the following
ANY-WORD: /^\pL+$/
ANY-WORD: /^\p{Xan}+$/
ANY-WORD: /^\p{Khmer}+$/
but none of these work.
The file dict-common/regex-morph.c
is using PCRE2_UTF|PCRE2_UCP
so that seems correct ...
It seems there is a bug in anysplit.c
. I will investigate that.
The problem is caused by breaking grapheme clusters when doing the splits. As you can see, the rules are not too simple. Since PCRE2 knows these rules, I'm trying to define REGPRE, REGMID, and REGSUF to match only valid sequences (without a success for now, the reason is still unknown).
Huh! Graphemes! But of course! If/when you get this working, can you put that URL into an appropriate README or into a code comment?
If/when you get this working, can you put that URL into an appropriate README or into a code comment?
OK.
I'm making now tests so I guess the PR will be ready only tomorrow.
Until then, here are the regexes, so you can test them too if you like: 4.0.affix:
"^(?:(?=\pL)\X|[[:alnum]]|')+$" : REGPRE+;
"^(?:(?=\pL)\X|[[:alnum]]|')+$" : REGMID+;
"^(?:(?=\pL)\X|[[:alnum]]|')+$" : REGSUF+;
4.0.regex:
ANY-WORD: /^(?:(?=\pL)\X|[[:alnum]]|')+$/
ANY-PUNCT: /^[[:punct:]]+$/
MOR-PREF: /^(?:(?=\pL)\X|[[:alnum]]|')+=$/
MOR-STEM: /^(?:(?=\pL)\X|[[:alnum]]|')+\x03=$/
MOR-SUFF: /^=(?:(?=\pL)\X|[[:alnum]]|')+$/
JUNK: !/[[:punct:]]/
JUNK: /^/
\X
matches a grapheme. But it is similar to a dot for ASCII. Since we would like to match only the equivalent of [:alnum:]
, I added a lookahead of a Unicode letter. To be similar to the original regex, as alternatives I added [:alnum:]
and '
(maybe replacing the lookahead with[:alnum:]
is better, as there will be no need to add it as an alternative - I will test it).
Also note the \0x03
: issue_word_alternative()
uses SUBSCRIPT_MARK. You also encountered this problem and used .
instead of \.
(which depends on MOR-PREF not to match dot).
This makes amy
depending on PCRE2, so I added a comment on that in the files.
I don't know if it is needed, but we can leave the old 4.0.regex and 4.0.affix (renamed) in the distribution for languages that don't need grapheme detection (this is also faster). In any case, we can recommend to configure with PCRE2 (it is also faster than the alternatives).
BTW, here is how to debug it:
link-parser amy -m -v=10 -de=anysplit.c,separate_word,regex-morph.c
Can you add the explanation above into the regex file? including the how-to-debug part?
Can you add the explanation above into the regex file? including the how-to-debug part?
I added it.
BTW, I didn't change the any
configuration files for handling graphemes, but instead added a reference to amy/4.0.regex.
I said above:
for languages that don't need grapheme detection (this is also faster)
Instead of that, I have left the original regexes in a commented-out section, with an appropriate comment.
Huh! Graphemes! But of course! If/when you get this working, can you put that URL into an appropriate README or into a code comment?
I have not found an appropriate README so I added it in a comment in anysplit()
.
I will also add a trivial test to validate graphemes are not getting broken in amy
.
(To be sent after the next PR, which fixes problems related to Python.)
Looks like #1320 fixes most of the issues. I'm still seeing:
[!JUNK]
appearing sometimes, and there are never any morpheme splits.Also, just as before,
give one parse only, and there are no morpheme splits. However ... by cutting away the double-dot colon-like character, then I do get multiple parses (17 of them)
Likewise, for the Lao, cutting away the initial two characters gives 17 parses:
Note that doing this cchanges the rendering of the final characters as well! So I assume that the intial characters are some kind of grapheme-controlling/modulating characters. Perhaps quote marks of some kind, that inhibit the grapheme processing?
All the other examples above look reasonable. The combinatorial explosions went away, but they still have an unexpectedly large number of parses.
To conclude: Burmese is still broken. However, this might be a pcre2 issue. Also, the way Burmese renders on my terminal looks wrong; its very different from what the browser shows. Perhaps its due to some missing packages?
- Burmese ဤသည်မှာ ပိုရှည်သောဝါကျတစ်ခု၏ စမ်းသပ်မှုတစ်ခုဖြစ်သည်။
I'm investigating this...
The first word ဤသည်မှ (\u1024\u101e\u100a\u103a\u1019\u103e\u102c
) causes the problem: \u102c
doesn't match in the ANY-WORD regex (the previous ones match).
- Khmer: នេះគឺជាការសាកល្បងនៃប្រយោគវែងជាងនេះ។
- Lao: ນີ້ແມ່ນການທົດສອບຂອງປະໂຫຍກທີ່ຍາວກວ່າ
Note that anysplit.c
can handle up to 31 Unicode characters.
It will be reasonable to change that to 31 graphemes (a change in anysplit.c
).
If needed, this number can be somewhat enlarged, but not by very much.
While I make changes in the regexes, I would like to know if it is still needed to include an apostrophe in them.
https://github.com/opencog/link-grammar/blob/38887f69064c439b7d6c7c0b928c4a2d4c8fc8e4/data/amy/4.0.regex#L18-L19
It is now possible to auto-split apostrophes using MPUNC. The original word is left as an alternative. If '
is included in MPUNC, the inclusion of an apostrophe in ANY-WORD is needed only if you would like to also get the original word (i.e. including the apostrophe) as a morpheme.
get rid of the special-case treatment for the apostrophe ...
FY amusement : gmail has started flagging the emails containing the grapheme examples as possible-spam/possibly-harmful, are you sure?
- Burmese ဤသည်မှာ ပိုရှည်သောဝါကျတစ်ခု၏ စမ်းသပ်မှုတစ်ခုဖြစ်သည်။
The phrase ဤသည်မှ ends with a grapheme which consists of only a mark character. See number 5 here. So I added this possibility to the regexes.
The phrase ာ ပိုရှည်သောဝါကျတစ်ခု၏ ends with punctuation ၏, named MYANMAR SYMBOL GENITIVE. See here for more details about Burmese punctuation (including this one).
Since we don't split on script-specific punctuation, I added "other punctuation" (the Unicode category of this particular punctuation) to the word/part regexes (\pPo
). Maybe I should have added additional categories for this or other scripts - I just don't know.
There are many additional Unicode categories. I don't know which of them are valid in words.
Maybe we can just accept any non-punctuation character in a word, and split on any punctuation. Such a split will need code
to accept something like /\pP/: LPUNC
.
A comment regarding the functionality of "amy": The Burmese language is written in space-separated phrases, not words, and there are no spaces in Lao and Khmer (like in some other languages). So "amy" splits such languages into pieces which include several words, often starting and ending with half-words. At first glance at least, this seems unuseful.
Maybe I should have added additional categories for this or other scripts - I just don't know. ... So "amy" splits such languages into pieces which include several words, often starting and ending with half-words. At first glance at least, this seems unuseful.
The long-term goal is to automate tokenization, of which #1311 "scratches the surface".
The strategic goal is to also perform this kind of tokenization on audio streams, 2D photos, and other types of data (and I have some vague ideas how to accomplish this, written up elsewhere).
The strategic vision is to be able to perceive entities in arbitrary sensory environments.
So, for the short and medium-term, relying on explicit knowledge of unicode in utf8 byte streams seems just fine, for now. It will be a while before a learning pipeline can be unleashed on Lao.
More narrowly: if some of the morphemes are actually two-and-a-half words, presumably the statistics will notice that, and unweight them. The point of AMY is to provide a kind of uniform statistical sampling (uniform in the space of trees) The actual structure cannot be determined until there have been enough samples.
I'm going to close this now, since #1321 appears to resolve all remaining issues.
If you want to re-open, and/or have other things to document, discuss -- please, go ahead!
Many thanks for fixing this; @ampli, you're awesome! This stuff wouldn't work without you!
This recaps posts made on pull req #1312 -- basically AMY works great with some character sets, but not others.
Amy works great on the following
Fails giving one parse only:
Anything that uses the sanskrit or related writing system has a count overflow, e.g.
The Thai phraset นี่คือการทดสอบประโยคที่ยาวขึ้น does come up
[!JUNK]
on the first linkage, and it never uses the LL link type, whereas the prior three do use the LL link type. On very rare occasions, the = sign prints weird:... although the above is not how it prints in my terminal window. Heh. I'm guessing that amy chose to insert a break where breaks are not allowed. The
=ี่
seems to be some kind of character modifier and I think there must never be a break there.The sanskrit variant इदं दीर्घतरवाक्यस्य परीक्षा अस्त gives a combinatorial explosion, which is unexpected:
and again, the LL link is never used. Surprising.