anysplit issues. - Githubissues

linas commented 7 years ago

consider this:

link-parser amy
linkparser> !morp
linkparser> adsfasdfasdfasdf

this attempts the case wit one part (no splits) and with 3 parts (two splits) it never outputs any attempts to split into two parts.

Editing data/amy/4.0.affix and changing 3: REGPARTS+; to 4: REGPARTS+; never generates splits into 4 parts.

I tried setting "w|ts|pts|pms|pmms|ptts|ps|ts": SANEMORPHISM+; but this seemed to have no effect.

ampli commented 7 years ago

In addition to fixing anysplit.c, I fixed only the amy dict... I didn't look at the any dict. I will try to fix it too.

But note that 4 parts are getting split to: pref= stem.= =suf1 =suf2 To make it in another way, I need exact directions on how to mark them.

linas commented 7 years ago

sorry, I meant "amy".

ampli commented 7 years ago

I took a glance at any. Its dict is not designed for morphology at all. So maybe no change is needed, and you need to test the amy dict with more than 3 parts.

linas commented 7 years ago

I just tried "w|ts|pts|pms|pmss|ptss|ps|ts": SANEMORPHISM+; i.e. using two suffixes, togehter with 4: REGPARTS+; and that does not seem to fix anything.

Side question: for Hebrew, if I had to split a word into all of its morphological components, how many pieces might it have (in the common cases)? I get the impression that almost every letter could be a morpheme by itself; is 6 enough, or would more be needed?

ampli commented 7 years ago

The problem is that the current amy/4.0.affix in the repository is not the version that I included in PR #481. If you replace it with the version I provided, then it gets split fine as intended.

EDIT: My error. The file at master now also works fine with 3 parts. The bug is now with > 3 parts...

ampli commented 7 years ago

Side question: for Hebrew, if I had to split a word into all of its morphological components, how many pieces might it have (in the common cases)? I get the impression that almost every letter could be a morpheme by itself; is 6 enough, or would more be needed?

In the common case it is up to 4 pieces a start of word. For possibility demo, people constructed also a 5 pieces prefix. So 5 is the answer for prefixes. Only certain letters can be included in such a prefix. Each such piece consists of 1-3 characters. There are about 12 such strings (depending on how you count them). Of course it is very common that what can be looked as prefix is actually an integral part of a word, and also an isolated word may commonly several meaning, depending on how many pieces you consider as a prefix andhow many as integral part of the word (creating a vast ambiguity). These start peices have concatenated morphology (with a slight twist that I have not mentioned).

The end of a regular word can also include some (totally another) morphemes (usually 1-2 letters). I think up to 2.

Verb inflections have their own different prefixes/suffixes. There morphology is not concatenative but, interestingly, there is a concatenative approximation for them (if you use a kind of an artificial "base").

(Note also that you have hard time to conclude anything from derivational morphology - no definite rules in t.)

But how you are going to handle language which have different codepoints to same letters, depending on their position in the word? On first glance, this seems to ruin morphology conclusions unless you note that fact (e.g. by preprocessing, the equivalent of lowercasing the first letter in English).

ampli commented 7 years ago

I have just said:

The problem is that the current amy/4.0.affix in the repository is not the version that I included in PR #481. If you replace it with the version I provided, then it gets split fine as intended.

This is true for 3 parts, as appear in PR #481. When using the correct amy/4.0.affix, then indeed all seems to be fine.

But when I change it to 4, I also get a problem. I'll send a PR to correct it...

EDIT: The file at master now also works fine with 3 parts. The bug is now with > 3 parts...

ampli commented 7 years ago

The actual problem is because 4 parts and more are currently translated to "multi suffix". I.e., adsfasdfasdfasdf can be broken as: adsf= asdfa.= =sdfa =sdf But amy/4.0.dict doesn't provide a way for that to have a linkage!

It can be fixed in several ways, all need a dict modification: 1) Provide me with another scheme to mark more than 3 parts. 2) You can use a marking for middle morphemes, done solely in the dict: =SUF.= These middle morphemes can be linked to stems (if they have {@LL+}) or to previous middle morphemes, or to both, as you like (the same is said for "real" suffixes, i.e. the last token).

I think option (2) is reasonable.

linas commented 7 years ago

On Wed, Jan 25, 2017 at 4:12 PM, Amir Plivatsky notifications@github.com wrote:

Side question: for Hebrew, if I had to split a word into all of its morphological components, how many pieces might it have (in the common cases)? I get the impression that almost every letter could be a morpheme by itself; is 6 enough, or would more be needed?

In the common case it is up to 4 pieces a start of word. For possibility demo, people constructed also a 5 pieces prefix. So 5 is the answer for prefixes. Only certain letters can be included in such a prefix. Each such piece consists of 1-3 characters. There are about 12 such strings (depending on how you count them). Of course it is very common that what can be looked as prefix is actually an integral part of a word, and also an isolated word may commonly several meaning, depending on how many pieces you consider as a prefix andhow many as integral part of the word (creating a vast ambiguity). These start peices have concatenated morphology (with a slight twist that I have not mentioned).

The end of a regular word can also include some (totally another) morphemes (usually 1-2 letters). I think up to 2.

Grand total sounds like maybe 7-8, plus maybe 3 more for verbs. Whew.

Verb inflections have their own different prefixes/suffixes. There morphology is not concatenative but, interestingly, there is a concatenative approximation for them (if you use a kind of an artificial "base").

I understand its not concatenative; I'm hoping that the more complex syntactic structures are enough to get this done right.

But how you are going to handle language which have different codepoints to same letters, depending on their position in the word? On first glance, this seems to ruin morphology conclusions unless you note that fact (e.g. by preprocessing, the equivalent of lowercasing the first letter in English).

Don't know. I'm also planning on not downcasing, and just seeing what happens. Ask again in a few months. I'm still in very early stages, and just trying to get a map for what the software needs to support.

--linas

linas commented 7 years ago

I can be fixed in several ways, all need a dict modification:

Provide me with another scheme to mark more than 3 parts.

You can use a marking for middle morphemes, done solely in the dict: =SUF.= These middle morphemes can be linked to stems (if they have {@LL https://github.com/LL+}) or to previous middle morphemes, or to both, as you like (the same is said for "real" suffixes, i.e. the last token).

I think option (2) is reasonable.

Ah! Yes. Clearly, I did not try very hard. My current plan is to use only two link types at the initial stages: one type between morphemes in the same word, and another between words. Later stages will create more link types, including appropriate ones for suffixes, prefixes, etc.

linas commented 7 years ago

OK, I just fixed something, and now a new issues arises: the original issue: due to the interaction between the morphology/wordgraph and the parser, the vast majority of parses are not sane morphisms. Thus, I re-wrote, in pull reqs #486 and #485 an algo that keeps looping until it finds more sane morphisms. This works ... sort-of. For 4-word sentences, the sane morphisms can be one-in-a-thousand. For 8-12 word sentences, the result can be one sane morphism in a million, which is far far too many to examine, just to find one that works.

So, at the moment, splitting words into three kind-of-ish works, on shorter sentences, but clearly, splitting into even more parts will not work, even on the shortest sentences.

linas commented 7 years ago

and there is another, new strange issue: a single word, of 8 letters, now gets split into 1 or 2 r 3 parts, more-or-less as it should.

A single word, of 11 letters, is never split: 54 linakges are reported, and all but 3 of them are the same, and completely unpslit! This did work after pull req #481, but something later broke it. Bisecting now

ampli commented 7 years ago

For 8-12 word sentences, the result can be one sane morphism in a million, which is far far too many to examine, just to find one that works.

It is possible to add an alternatives-position-hierarchy comparison to expression_prune(), power_prune(), and even to the fast-matcher (in which it can even be cached), so matching mixed alternatives will be pruned in ahead. Maybe even adding it only to expressio_puning() will drastically increase the density of good results.

Also note that there is a memory leak in partial_init_linkage(). (Strangely, there is also big memory leak with amy+sat - I will look at that.)

linas commented 7 years ago

Ah. sort-of-found it. amy/4.0.affix had 4: REGPARTS+; and when I set back to 3, it behaves more reasonably... but there are still issues. Words with 20 letters will often not split at all, and when they do split, they often split exactly the same way. test case: single "sentence" with long word. Somehow, the randomness is poorly distributed.

ampli commented 7 years ago

With more than 3 parts you need the said dict change...

With a >20 word it looks fine for me.

To see the sampling, use the following: link-parser amy -m -v=9 -debug=anysplit.c

When I tried it with the word abcdefghijklmnopqrstuvwxyz the results looked "reasonable".

ampli commented 7 years ago

The whole word is issued only once, as you can see by: link-parser amy -m -v=7 -debug=anysplit.c,issue_word_alternative,flatten_wordgraph,print.c

The fact that many linkages include only it seems as an artifact of the classic parser. With the sat-parser this doesn't happen on my test word abcdefghijklmnopqrstuvwxyz .

linas commented 7 years ago

It is possible to add an alternatives-position-hierarchy comparison to expression_prune(), power_prune(), and even to the fast-matcher (in which it can even be cached), so matching mixed alternatives will be pruned in ahead. Maybe even adding it only to expressio_puning() will drastically increase the density of good results.

Should I ask you to do this? Its kind of low-priority, but its a blocker to more complex morphology work.

Also note that there is a memory leak in partial_init_linkage().

OK, thanks, I think I fixed it in #487

linas commented 7 years ago

When I tried it with the word abcdefghijklmnopqrstuvwxyz the results looked "reasonable".

I get this:
linkparser> abcdefghijklmnopqrstuvwxyz
Found 1766 linkages (162 of 162 random linkages had no P.P. violations)
Linkage 1, cost vector = (CORP=0.0000 UNUSED=0 DIS= 0.00 LEN=0)

+------------------ANY------------------+
|             +------------LL-----------+
|             |                         |

LEFT-WALL abc[!MOR-STEM].= =defghijklmnopqrstuvwxyz[!MOR-SUFF]

Press RETURN for the next linkage. linkparser> Linkage 2, cost vector = (CORP=0.0000 UNUSED=0 DIS= 0.00 LEN=0)

+----------ANY----------+
|                       |

LEFT-WALL abcdefghijklmnopqrstuvwxyz[!ANY-WORD]


and then linkage 3,7,12 is same as 1
linkage 4,5,6, 8,9,10,11 is same as 2

and so on, the foist one that's different is linkage 28

linkparser> Linkage 28, cost vector = (CORP=0.0000 UNUSED=0 DIS= 0.00 LEN=0)

+---------------------------ANY--------------------------+
|               +--------PL--------+----------LL---------+
|               |                  |                     |

LEFT-WALL abcdefgh=[!MOR-PREF] ijk[!MOR-STEM].= =lmnopqrstuvwxyz[!MOR-SUFF]


and then its back to case 1 or 2 until linkage 53 ...

linas commented 7 years ago

ah, indeed, with SAT, that repeated-issue goes away. Maybe with the classic algo, the random selector keeps hitting the same combination, over and over. I think I can kind-of guess why, its a side-effect of the sparsity.

linas commented 7 years ago

Flip side: I tried the SAT parser on "Le taux de liaison du ciclésonide aux protéines plasmatiques humaines est en moyenne de 99 %." and after 8+minutes CPU, its still thinking about it. Clearly, there's a combinatoric explosion, so even here, expression_prune(), power_prune(), will be needed. Although I'm confused ... if I think about it naively, adding a sane-morphism check to expression-prune won't fix anything, will it?

What would fix things would be to have a different, unique link type for each different splitting, although that then makes the counting algorithm a bit trickier. I'd have to think about that some more.

ampli commented 7 years ago

If you test this sentence using: link-parser amy -u -m -v=7 -debug=sane_linkage_morphism,print_with_subscript_dot you will see that it very fast generates linkages. However, you will also see that due to the extremely low density of good linkages, all of them are rejected by sane morphism...

In the sat-parser this can be solved by using sane-morphism constraints (and thus make unnecessary the use of the sane_linkage_morphism() function there). Theoretically this is said to make it faster for linkages with potential mixing.

Maybe with the classic algo, the random selector keeps hitting the same combination, over and over.

I tend to think so - this needs checking.

I think I can kind-of guess why, its a side-effect of the sparsity.

Is there a sparsity any more after the bad sane-morphism deletion fix?

if I think about it naively, adding a sane-morphism check to expression-prune won't fix anything, will it?

You are right. The related fix should be applied to power_prune().

What would fix things would be to have a different, unique link type for each different splitting, although that then makes the counting algorithm a bit trickier. I'd have to think about that some more.

Maybe a kind of "checksum" can be done for each linkage and get hashed, enabling rejecting of identical linkages.

ampli commented 7 years ago

Here is my plan for mixed alternatives pruning: 1) Put in each connector the word member from its disjunct. 2) Use it in possible_connection() to reject match between connectors of different alternatives.

In order not to increase the size of the Connector struct, I thought of sharing tableNext:

struct Connector_struct
{
    ...
    union
    {
        Connector * tableNext;
        const Gword **word;
    };
};

This way no changes are needed in the usage of tableNext. Just before the call to power_prune() this word can be assigned.

I have no idea how much overhead this may add to sentences with a few alternatives. For sentences with no alternatives at all this can be skipped, and for sentences with many alternatives I guess it may significantly reduce the linkage time.

linas commented 7 years ago

The sparsity is still there, in the classic algo. For the abcdefghijklmnopqrstuvwxyz test, it counts 1766 linkages, but then, with random sampling, finds only 162 out of 1000 random samples.

If I increase the limit to 2000, then it counts as before, but then later revises that count to 17, because it can exhaustively enumerate all of these. Its sort of a surprising behavior, that the exhaustive attempt revises the count; Its kind-of a feature-bug I guess.

Hashing the linkage sounds like a good idea. But fixing the sparsity at an earlier stage seems more important.

Playing with unions is kind-of like playing with fire. I know that performance is sensative to the connector size, but I don't recall any specifics. At one point, I recall measuring that 2 or 3 or 5 connectors would fit onto one cache line, which at the time seemed like a good thing. Now I'm less clear on this.

There is a paper on link-grammar, describing "multi-colored" LG: connectors would be sorted into different "colored" categories, that would work independently of each other. This allowed the authors to solve some not-uncommon linguistic problem ,although I don't recall quite what. Because they're independent, there's no link-crossing constraints between different colors -- there's no constraints at all between different colors.

Given how Bruce Can was describing Turkish, it seems like it might be a language that would need multi-colored connectors.

Of course, I'm talking about this because perhaps, instead of thinking "this gword/morpheme can only connect to this other gword/morpheme", and enforcing it in possible_connection() -- perhaps a better "mindset" would be to think: "this gword/morphme has a blue-colored connector GM67485+ that can only connect to this other blue-colored connector GM67485-" The end-result is the same, but the change of viewpoint might make it be more natural and naturally extensible... (clearly, it's islands_ok for these blue connectors)

ampli commented 7 years ago

Playing with unions is kind-of like playing with fire.

In that particular case there is no problem in that, as tableNext is not in use after expression_prune().

"this gword/morphme has a blue-colored connector GM67485+ that can only connect to this other blue-colored connector GM67485-"

The problem is that these color labels are scalars, while a token hierarchy position is a vector. See wordgraph_hier_position() and in_same_alternative().

linas commented 7 years ago

Hmm. Well, but couldn't the vector be turned into a hash? Comparing hashes would in any case be faster than comparing vectors. You don't even need to use hashes -- just some unique way of turning that vector into an integer, e.g. just by enumerating all possibilities for that given word.

ampli commented 7 years ago

Hmm. Well, but couldn't the vector be turned into a hash? Comparing hashes would in any case be faster than comparing vectors. You don't even need to use hashes -- just some unique way of turning that vector into an integer, e.g. just by enumerating all possibilities for that given word.

Note that in_same_alternative() doesn't compare between vectors, but between parts of vectors, when the number of vector components that are compared is not known in advance. The comparison stops when 2 vector components are not equal.

Say you have 3 tokens, A, B and C. A can connect to B and to C, but B cannot connect to C. How do you assign numbers that can resolve that via comparisons?

To see how complex connectivity rules between tokens can arise, consider that every token can split again, and the result can split again, creating a hierarchy of splits (the word-graph). But even a sentence with one level of alternatives (like Russian or Hebrew without spell-correction) has these kind of relations - tokens of an alternative can connect to sentence words, but tokens of one alternative cannot connect to tokens of another alternative if both alternatives are of the same word, but can connect to the alternatives of another word. (If these is not clear, try to look at it deeply, and especially think of spell correction that separate words and also gives alternatives, and then all of these tokens get broken to morphemes, each in more than one way.)

To find if tokens a and b are from the same alternative, the algo looks at their hierarchy position vectors Va and Vb. It compares their components one by one, until they don't equal. If even number of components are equal, then the tokens can connect, else they are not.

Usually the position hierarchy vectors have 0 to 4 elements (corresponding to hierarchy depth 0 to 2), so there is no much overhead in their "comparisons". For sentence without alternatives, all the position hierarchy vectors are of length 0. One shortcut that I thought of is to use the same pointer for all equal vectors (like string-set), because tokens with equal vectors can connect - these vectors always contain even number of components (tokens with unequal vectors can connect or not as I mentioned above).

linas commented 7 years ago

its really late at night and I'm about to go to bed so my reply might be off-the-wall, cause I''m not reading or thinking about your code .. .but .. hey: the disjunct is a vector. each connector is a single item in the vector.

I mean, that's kind-of the whole point about all that blather about category theory-- the category of hilbert spaces allows you to define tensors, where you can contract upper and lower (co- and contra-variant) indexes on vectors and tensors.The LG grammar, and most/all categorial grammars are quite similar, just enriching the possibilities of how to contract (match) indexes (connectors): you can force contraction to left or right (hilbert spaces make no left-right distinction) and you can mark some connectors as optional.

Now, the classic LG connectors and connection rules were fairly specific, but I've enriched them with more stuff, and we can enrich or elaborate further. So, step back, and think in abstract terms: instead of calling them vectors, call them a new-weird-disjunct-thing; each vector-component can be called a new-weird-connector-thing, and then some ground rules: can we have more than one link between two morphemes? what else might be allowed or prohibited, generically, for these new things?

My gut intuition is that developing this abstract understanding clarifies the problem, and the solution, even if the resulting C code ends up being almost the same... and the new abstract understanding might give an idea of a better C implementation.

I'll try a more down-to-earth reply sometime later.

ampli commented 7 years ago

I implemented not-same-alternative prune in prune.c. I didn't think about efficiency when doing it, so the change is small.

In possible_connection() before easy_match():

    bool same_alternative = false;
    for (Gword **lg = (Gword **)lc->word; NULL != (*lg); lg++) {
        for (Gword **rg = (Gword **)rc->word; NULL != (*rg); rg++) {
            if (in_same_alternative(*lg, *rg)) {
                same_alternative = true;
            }
        }
    }
    if (!same_alternative) return false;

To support that, the word member of Connector (see a previous post) is initialized at the start of power_prune():

    for (w = 0; w < sent->length; w++) {
        for (d = sent->word[w].d; d != NULL; d = d->next) {
            for (c = d->right; NULL != c; c = c->next)
                c->word = d->word;
            for (c = d->left; NULL != c; c = c->next)
                c->word = d->word;
        }
    }

The same linkages are produced, so I guess this indeed doesn't remove anything that is needed. However, surprisingly, batch run times are not reduced . However, debugging shows the added check returns false on mismatched alternatives (only). Also, the amy test of `abcdefghijklmnopqrstuvwxyz got improved: Before:

Found 1766 linkages (162 of 162 random linkages had no P.P. violations)

After:

Found 1496 linkages (289 of 289 random linkages had no P.P. violations)

If this improvement seems worthwhile, I can add efficiency changes to it and send a PR.

ampli commented 7 years ago

Next thing to try is maybe add such checks to the fast-matcher.

ampli commented 7 years ago

In the previous post I said:

Next thing to try is maybe add such checks to the fast-matcher.

Ok, I tested this too. It doesn't do anything more than has already been done in prune.c.

There is another constraint on alternatives possible connection that is maybe the reason of most of the insane-morphism connections:

A word cannot connect to tokens from different alternatives of another token.

Consider this situation: We have 2 words - A and B. Word B has two alternatives - C and "D E".

A   B
     alt1: C
     alt2: D E

A cannot connect to C and E at the same time.

However, I don't know how to implement an apriori check for that (in power_prune() or elsewhere) ...

ampli commented 7 years ago

I wrote above:

A word cannot connect to tokens from different alternatives of another token.

This is also a private case...

It turned out the complete rule covers the two cases and more.

Here is the complete rule: Two (or more) tokens from different alternatives of the same token cannot have a connection (between them or to other tokens) at the same time.

The private case of a connection between tokens in different alternatives is easy to check (and it is what I forbidden in possible_connection()). What I don't know how to implement the check when the connection is not directly between the checked tokens.

ampli commented 7 years ago

I found a way to prune the following too (quoting from my post):

A word cannot connect to tokens from different alternatives of another token.

However, there appear to be no such cases in the current ady/amy dicts.

The 3 private case that complement the 2 private cases mentioned above is when two tokens from different alternatives have a connection each to a different token. This is the hardest case to test, and I think it can only be tested (in the classic parser) during the counting stage.

So for now I only implemented the alternatives compatibility test between two connectors (the quoted code). Here is its results for the sentence:

$ link-parser ady -m
linkparser> Le taux de liaison du ciclésonide aux protéines plasmatiques humaines est en moyenne de 99 %.

Before:

Found 2147483647 linkages (118 of 118 random linkages had no P.P. violations)

After:

Found 2147483647 linkages (345 of 345 random linkages had no P.P. violations)

EDIT: ~(No chance under "amy"...)~

ampli commented 7 years ago

I was too pessimistic. More later.

linas commented 7 years ago

On Sat, Jan 28, 2017 at 5:18 PM, Amir Plivatsky notifications@github.com wrote:

I implemented not-same-alternative prune in prune.c. I didn't think about efficiency when doing it, so the change is small.

The same linkages are produced, so I guess this indeed doesn't remove anything that is needed.

However, surprisingly, batch run times are not reduced .

Should not be a surprise -- the English batch files generate very few alternatives. They Russian ones more, but not overwhelmingly more.

However, debugging shows the added check returns false on mismatched alternatives (only). Also, the amy test of `abcdefghijklmnopqrstuvwxyz got improved: Before:

Found 1766 linkages (162 of 162 random linkages had no P.P. violations)

After:

Found 1496 linkages (289 of 289 random linkages had no P.P. violations)

If this improvement seems worthwhile, I can add efficiency changes to it and send a PR.

Sounds good; I do expect that an 'amy' batch (run against, say the english batch) would run a lot faster.

linas commented 7 years ago

On Sat, Jan 28, 2017 at 7:17 PM, Amir Plivatsky notifications@github.com wrote:

A word cannot connect to tokens from different alternatives of another token.

Consider this situation: We have 2 words - A and B. Word B has two alternatives - C and "D E".

A B alt1: C alt2: D E

A cannot connect to C and E at the same time.

I suspect that a common case is to have an 'ANY' link from A to C and an LL from C to E .. (and then the rest of the sentence connecting to E...)

linas commented 7 years ago

I just tried a quick performance test: standard LG, amy applied on the first 45 or so sentences from en/corpus-basic : 1m26.627s 1m28.194s

with patches mentioned above: 1m35.844s 1m36.712s

So the same alternative check make it run more slowly. -- that is surprising. Sort-of. Maybe. It suggests that the check isn't actually pruning anything at all. -- and perhaps that is not a surprise, because the pruning stage is too early.

However, doing the in_same_alternative during the actual parsing (i.e. count.c where the matching is done .. actually, I guess in fast-match.c) -- now there, I expect a big difference, because invalid links will be ruled out.

ampli commented 7 years ago

I said:

I tested this too. It doesn't do anything more than has already been done in prune.c.

I was too pessimistic. More later.

I had a slight bug in doing it... After fixing it, the amy insane morphism is prevented in advance!

The first-45-sentence test then runs more than 4 times faster (with the fast-matcher patch w/o the power-prune patch). The run times were 9.3 seconds vs 41.7 seconds.

However, doing the in_same_alternative during the actual parsing (i.e. count.c where the matching is done .. actually, I guess in fast-match.c) -- now there, I expect a big difference, because invalid links will be ruled out.

As we see, you are indeed right.

I will implement efficiency changes and will send a PR soon.

ampli commented 7 years ago

can we have more than one link between two morphemes?

This could be helpful in any case. The current limitation seems to me like enforcing writing numbers always like "1+1+..." because this is enough to express any number.

Now, the classic LG connectors and connection rules were fairly specific, but I've enriched them with more stuff, and we can enrich or elaborate further. So, step back, and think in abstract terms: instead of calling them vectors, call them a new-weird-disjunct-thing; each vector-component can be called a new-weird-connector-thing, and then some ground rules: can we have more than one link between two morphemes? what else might be allowed or prohibited, generically, for these new things?

I'm still thinking on that....

ampli commented 7 years ago

I noted that many words are UNKNWON_WORD, and when an affix is marked so, this causes to null words.

One class of such words are those with a punctuation after them. How this should be fixed? Shouldn't anything just considered as integral part of words, when punctuation are a kind of "morphemes" with their own rules?

linas commented 7 years ago

From the point of view of the splitter, it would be more convenient if punctuation behaved like morphemes. But from linguistics, we know that punctuation really does not behave like that. I believe we could prove this statistically: the mutual information of observing a word-punctuation pair would be quite low: in other words, there is little correlation between punctuation, and the word immediately before (or after) it. The correlation is between the punctuation, and phrases as a whole.

The problem is that ady and amy both insist on placing an LL link between morphemes: by definition, morphemes must have a link between each other, although they can also have links to other words. Treating punctuation like morphemes would make a link between the punctuation and the word mandatory, which is something that statistics will not support.

So, for now, treating punctuation as distinct seems like the best ad-hoc solution. Later, as the learning system becomes more capable, we might be able do do something different: an extreme example would be to ignore all spaces, completely... and discover words in some other way. Note that most anceint languages were written without spaces between words, without punctuation, and without upper-lower distinctions: these were typographical, meant to make reading easier. Commas help denote pauses for breathing, or for bracketing idea-expressions; ? and ! are used to indicate rising/falling tones, surprise.

We continue to innovate typographically :-) For example, emoticons are a markup used to convey emotional state, "out of band": like punctuation, the emoticons aren't really a part of the text: they are a side channel, telling you how the author feels, or how the author wants you to feel. Think musical notation: the musical notes run out-of-band, in parallel to the words that are sung.

ampli commented 7 years ago

Currently words with punctuation after them are "lost" (mar. If it is desired to split punctuation, can implement that (using ispunct()). The question is what to do with "words" with inter-punctuation, like http://example.com. I think it may be better not to split them. This may include initials that use dots.

Dashes and apostrophes will need to be ignored as punctuation (i.e. considered as part of he word). The same for languages that include other punctuation characters as integral part of words. Maybe a list of such characters can be given in the affix file of amy/ady (but then they will be input-language specific, unless you make a way to select the affix file via an argument, or something similar).

BTW, there a problem in the current definitions of affix regexes, when words are getting split to pref=, stem.= and =suf, but one (or more) ot the parts is not recognized by the regexes as an affix and thus classified as UNKNOWN_WORD, leading to null words. This is very common, and it increases the processing time by much due to repeated need to parse with nulls. A fix can be implemented to handle punctuation as proposed above, and/or classify morphemes in the regex file only by their marks (infixmark and stem mark) disregarding their other letters.

linas commented 7 years ago

Currently words with punctuation after them are "lost"

?

The question is what to do with "words" with inter-punctuation, like http://example.com http://example.com. I think it may be better not to split them.

a) there aren't supposed to be any urls in the text I'm parsing. Unfortunately, the scrubber scripts are imperfect.

b) If you let them split, then basic statistics should very quickly discover that http:// is a "morpheme", i.e. these 7 bytes always co-occur as a unit. Always. If you allow a 3-way split, them the discovery of .com and .org should be straight-forward, as well. So, yes, URL's have a morphological structure, and actually it is very regular, far more regular than almost any natural language. The structure should be easy to find by statistical analysis. So, split them. ...

For the next few months, I mostly don't care, because performing a morphological analysis of URL's seems a bit pointless, right now.

This may include initials that use dots.

Beats me, random splitting should auto-discover these boundaries. We'll find out in a few months.

Dashes and apostrophes will need to be ignored as punctuation (i.e. considered as part of he word).

That depends on what the morpheme analysis discovers. Random splits+statistics will tell us how.

BTW, there a problem in the current definitions of affix regexes, when words are getting split to pref=, stem.= and =suf, but one (or more) ot the parts is not recognized by the regexes as an affix and thus classified as UNKNOWN_WORD, leading to null words.

That sounds like a bug. Do you have an example? I'm not clear on what is happening here.

A fix can be implemented to handle punctuation as proposed above,

No -- for now, it would be best to ignore most punctuation, and handle only a small set of special cases: words that end with a period, comma, semi-colon, colon, question-mark, exclamation point. I think that's it. Everything else should be treated as if it was an ordinary letter, an ordinary part of the word.

The only reason I want to treat these five terminal characters as special is to simplify the discovery process; I know that, a-priori, these puncts behave like words, not like morphemes. I'd rather not waste compute power right now on discovering this.

By contrast, the $ in $50 really does behave like a morpheme: it is a modifier of 50, and there would be a link that would be discovered that connects $ to 50 and the automatics random spliting of the string $50 should allow this discovery to happen automatically.

classify morphemes in the regex file only by their marks (infixmark and stem mark) disregarding their other letters.

I guess that sounds reasonable. (except for the fact that we talked about getting rid of these marks?)

ampli commented 7 years ago

Currently words with punctuation after them are "lost"

?

See in this example what happens to time; and however,

    +------------ANY------------+                                                                             +----------------------ANY----------------------+                               |       
    |            +------LL------+-----ANY----+------ANY-----+-----ANY-----+-------ANY-------+-------ANY-------+             +-------PL------+--------LL-------+                +------LL------+       
    |            |              |            |              |             |                 |                 |             |               |                 |                |              |       
LEFT-WALL s[!MOR-STEM].= =ame[!MOR-SUFF] time;[?] [ho] =wever,[?] after[!ANY-WORD] building[!ANY-WORD] and[!ANY-WORD] i=[!MOR-PREF] nst[!MOR-STEM].= =alling[!MOR-SUFF] th[!MOR-STEM].= =e[!MOR-SUFF]

BTW, there a problem in the current definitions of affix regexes, when words are getting split to pref=, stem.= and =suf, but one (or more) ot the parts is not recognized by the regexes as an affix and thus classified as UNKNOWN_WORD, leading to null words.

That sounds like a bug. Do you have an example? I'm not clear on what is happening here.

In the example above, =wever,[?] is not classified as a suffix, creating a null word ho.

Dashes and apostrophes will need to be ignored as punctuation (i.e. considered as part of he word).

That depends on what the morpheme analysis discovers. Random splits+statistics will tell us how.

But you need to accept the punctution that you allow as part of words (such as: $ - ' etc.) as part of words in the MOR- regexes (currently you accept only "-").

classify morphemes in the regex file only by their marks (infixmark and stem mark) disregarding their other letters.

My proposal:

Add RPUNC with comma, period, etc.
Since RPUNC will get separated and will not exist any more in marked morphemes, accept every character in MOR-, as in (ady/4.0.regex):
```
SIMPLE-STEM: /=$/;
```

(except for the fact that we talked about getting rid of these marks?)

I can implement that, but you haven't comment on my proposal for that.

linas commented 7 years ago

See in this example what happens

.. ahh! Yeah that's a bug. My apologies -- 'amy' was quickly thrown together 3 years ago, and just barely revisited. So its buggy/suboptimal.

FWIW its taken me the enitre last month just to get the counting pipeline running in a stable fashion: it's now running for 24 hours+ without crashing, hanging, blowing up RAM, etc. It might still be generating poor data, but at last it seems stable so that's forward progress.

My proposal:

Add RPUNC with comma, period, etc.

Since RPUNC will get separated and will not exist any more in marked morphemes, accept every character in MOR-, as in (ady/4.0.regex):

SIMPLE-STEM: /=$/;

yes, sounds good!

(except for the fact that we talked about getting rid of these marks?)

I can implement that, but you haven't comment on my proposal for that.

Ah I lost track. which issue #? I guess its in my email box somewhere..

linas commented 7 years ago

Care to create an "aqy" which does 4-way splits (or 3 or 2...)?

ampli commented 7 years ago

(except for the fact that we talked about getting rid of these marks?)

I can implement that, but you haven't comment on my proposal for that.

Ah I lost track. which issue #? I guess its in my email box somewhere..

It was very recently, but I myself cannot locate it now. Here it is again (maybe in other words, and with additions/omissions... I also combine another proposal that you have not commented on):

Tokenize words to just strings, and look them up without adding marks to them. In the dictionary there are more than one option. One of them is to still make a marking (for dictionary readability), when these marking, as said above, will be ignored for lookup (but not for token rendering with !morphology=1).

Inter-word connections will use connectors with a special character to denote that. There may be a need to have a way to specify a null affix, which will get used if an inter-word connector may connect to a null-affix.

Also, more info about tokens can be added outside of their strings, and be used to denote the affix type. This will allow to implement Unification Link Grammar or context sensitive parsing (which we have never discussed), that in any case need some different dictionary somehwre. For not introducing too much changes, I once poposed a two-step lookup, when you first look up the token in a special dictionary to find up some info on it, and consult the link-grammar dictionary (possibly using another string or even several strings) only to find the disjuncts.

ampli commented 7 years ago

Care to create an "aqy" which does 4-way splits (or 3 or 2...)?

OK.

Here is a summary of my proposal for that (assuming the current morpheme marking): For 2 o 3: Same marking as nw. For 4 and up: pref= stem.= =midd1.= =midd2.= ... =suff

I this fine for links?

    +-----------------------------ANY------------------------------+
    |                                                              |
    |             +-------PL------+-------LL------+-------LL-------+
    |             |               |               |                |
LEFT-WALL abc=[!MOR-PREF] de[!MOR-STEM].= =ef[!MOR-MIDD].= =gh[!MOR-SUFF]

linas commented 7 years ago

Tokenize words to just strings, and look them up without adding marks to them. In the dictionary there are more than one option. One of them is to still make a marking (for dictionary readability), when these marking, as said above, will be ignored for lookup (but not for token rendering with !morphology=1).

Sure, at this level, sounds reasonable.

Inter-word connections will use connectors with a special character to denote that.

I think you mean "intra" intra-word would be "within the same word" inter-word would mean "between two different words."

So right now, in russian, LLXXX is an intra-word connector.

There may be a need to have a way to specify a null affix, which will get

used if an inter-word connector may connect to a null-affix.

Yes, probably needed, based on experience with Russian. I have no clue yet how that will be auto-learned.

Also, more info about tokens can be added outside of their strings, and be used to denote the affix type. This will allow to implement Unification Link Grammar https://github.com/opencog/link-grammar/issues/280

At the naive level, this seems reasonable, but a moment of reflection suggests that there's a long and winding road in that direction.

or context sensitive parsing (which we have never discussed),

Heh.

that in any case need some different dictionary somehwre. For not introducing too much changes, I once poposed a two-step lookup, when you first look up the token in a special dictionary to find up some info on it, and consult the link-grammar dictionary (possibly using another string or even several strings) only to find the disjuncts.

what sort of additional info is needed for a token? What do you envision?

linas commented 7 years ago

On Fri, Feb 3, 2017 at 10:27 PM, Amir Plivatsky notifications@github.com wrote:

Care to create an "aqy" which does 4-way splits (or 3 or 2...)?

OK.

Here is a summary of my proposal for that (assuming the current morpheme marking): For 2 o 3: Same marking as nw. For 4 and up: pref= stem.= =midd1.= =midd2.= ... =suff

I this fine for links?
+-----------------------------ANY------------------------------+
|                                                              |
|             +-------PL------+-------LL------+-------LL-------+
|             |               |               |                |
LEFT-WALL abc=[!MOR-PREF] de[!MOR-STEM].= =ef[!MOR-MIDD].= =gh[MOR-SUFF]

Yes, I think so. It might be better to pick more neutral, less suggestive named, like "MOR-1st" "MOR-2nd" "MOR-3rd" or maybe MORA MORB MORC or something like that. The problem is that "STEM" has a distinct connotation, and we don't yet know if the second component will actually be a stem, or perhaps just a second prefix.

--linas

ampli commented 7 years ago

I think you mean "intra" intra-word would be "within the same word" inter-word would mean "between two different words."

Yes, I meant "intra".

what sort of additional info is needed for a token? What do you envision?

First, I admit that everything (even whole dictionaries...) can be encoded into the subscript. Moreover, everything can even be encoded into the token strings (the infix marks are such a kind of thing). But I don't find any reason why this is to be considered a good practice.

In addition, everything needed for readability can be done using % comments. In my proposal not to use infix marks for dict lookups, I mentioned the option to leave them in the dict tokens for readability but ignore them in the dict read. Even this is not needed, as it can be mentioned in a comment such as:

% xxx=
xxx: ...

But this, for example, prevents consistency checking at dict read time (e.g such a token must have a right going connector, or even (if implemented) an "intar-word" right going connector). So this is a bad idea too.

Examples: 1) Marking regex words in the dict (like NUMBER) so they will not get matched by sentence tokens. Again, this can be hacked in the current dict format by, e.g., starting it with a dot (.NUMBER) or using a special first subscript letter. 2) A indication that says that a regex should be applied to pops-up a "vocal" morpheme in order to implement a phonetic agreement (like a/an) in a trivial way (an old proposal of mine). However, this exact thing be implemented using a different regex file syntax. 3) An indication that a token needs one or more null-morphemes to be popped-up before or after it. This indication can also contain more info on these null-morphemes (e.g. for display purpose, but also which kind of null morphemes to fetch from the dict in case there are several kinds). 4) An indication that a dict word is not a string that is part of a sentence token, but is a morpheme template (to be used for non-concatenative languages). 5) Info on corrections. 6) info on how to split words in rare occasions. For illustration:

an: a n;
a: ...;
n.an: ...;

7) More things that I cannot recall just now.

Yes, I think so. It might be better to pick more neutral, less suggestive named, like "MOR-1st" "MOR-2nd" "MOR-3rd" or maybe MORA MORB MORC or something like that. The problem is that "STEM" has a distinct connotation, and we don't yet know if the second component will actually be a stem, or perhaps just a second prefix.

A total other alternative is not to use infix marking, but encode it in the subscript (by anyfix.c): For the example I gave: abc.1 de.2 ef.3 gh.4 It has the benefit of changing REGPARTS only, without a need to change anything else (same dict/regex/affix files).

opencog / link-grammar

anysplit issues. #482