opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
388 stars 118 forks source link

parsing performance is getting worse... #250

Closed linas closed 4 years ago

linas commented 8 years ago

Amir: consider this:

time echo "In vivo studies of the activity of four of the kinases, KinA, KinC, KinD (ykvD) and KinE (ykrQ), using abrB transcription as an indicator of Spo0A~P level, revealed that KinC and KinD were responsible for Spo0A~P production during the exponential phase of growth in the absence of KinA and KinB." | link-parser ./data/en -timeout=300 

how long does this take? With todays 5.3.3 parser and 5.3.3 dicts: 48 seconds. Some other times:

5.3.3 parser 5.3.3 dict: 48 sec
             5.2.5 dict: 48 secs
             5.1.3 dict: 44 sec
             4.8.6 dict: 22 sec

i.e. all of the above with todays parser, but other dicts.

But the 4.8.6 parser with the 4.8.6 dicts parsed this sentence in 8 seconds. So, despite performance improvements, things have gotten a lot worse.

Some of this I understand: changes to the dicts do add more possibilities. Some I don't: why is the the 5.3.3 parser 4x slower than the 4.8.6 parser, when both use the same dict? According to my notes, I once parsed the above in under 4 seconds, using 5.1.0 ... so .. wtf

ampli commented 8 years ago

Hi Linas,

Consider that the default constraints setup is in favor of the old version. In 4.8.6 we have: cost-max 2.00 short 10 While in 5.3.3 we have: cost-max 2.70 short 16

Besides of this, we must disable spelling on such tests, since different versions handle it differently (different alternatives may be generated).

Also, the latest improvements concentrated on batch processing.

So I suggest to make the benchmark with similar parameters: link-parser DICT -spell=0 -short=10 -max-cost=2 -echo -batch -timeout=300 (Or: link-parser DICT -spell=0 -short=16 -max-cost=2.7 -echo -batch -timeout=300)

But even with this (-short=10 -max-cost=2) we get a large performance difference in favor of the old versions (using the 4.8.6 dict): 4.8.6: 1.31s 5.3.3: 6.11s

Using -verbosity=5 reveals a possible problem in 5.3.3:

power prune cost: 9427707
++++power pruned                            4.76 seconds

While in 4.8.6:

2858193 power prune cost:
++++power pruned (ruthless)                 0.63 seconds

We need to do bisections to find out when this apparent pruning problem happened.

Also note that in 5.3.3 sentences are often split to more tokens, and as we know, the parsing algo is very sensitive to the number of tokens. This extra splitting is because the unit-split tokenizing is very aggressive, and if a word consists all of units, it breaks it even without a root number. In our case it breaks (in 2 places) the word KinD into 3 tokens K in D. Even word like As and "sin" (not found in this sentence) are getting split to A s and s in, respectively. This may also cause more "valid" linkages, even though I don't know about any such examples. I also don't recall having in our corpus examples that need such an aggressive split. We may try to ifdef it out.

ampli commented 8 years ago

I forgot to add: Before doing the check above, I added EMPTY-WORD.zzz: ZZZ-; to the 4.8.6 dict. This is a must for library versions 5.3.x.

ampli commented 8 years ago

More info on the problem: This is a good version: 908507b54ba44879de18945d219836c66f1c8a53

Here an inefficiency of the pruning (for this test sentence) begins: 3bce1cf0bf67b1225ef7779eb984631424db7c6f The reason is a more aggressive unit split (still with no EMPTY-WORD), On our test sentence, the said power prune step time increased from ~0.7s to ~1.05 sec.

Here the pruning of this sentence becomes a total time hog (~5s): f73fa61cb720830d1da3de37c702e62f0888a9dd

For some reason the power pruning misbehaves when it encounters empty words.

linas commented 8 years ago

Here's why power-pruning misbehaves with empty words: its the length limit. We know that an empty word might connect to it's neighbor, but no farther. The pruner has to assume it may connect far away, i.e. hop over words. This suggests that we need a way to tell the pruner that there are links, the morphology links, the empty-word link, and the phonology links, that can ONLY connect to nearest neighbors. They're like "short links" but are super-short links

linas commented 8 years ago

Looks like its easy to hack a test: go into preparation.c if link-type=ZZZ then c->length_limit = 1;

ampli commented 8 years ago

Having a flag to enable/disable aggressive splitting is OK. If a word already is in the dict, it should not be split. What we need is an indicator: this word is a common word; don't split it. this word is an unusual word; do split it.

I also thing that splitting words to 1-letter unit-names parts, especially when there is no root number, is not constructional and can be eliminated. This will prevent most of these splits.

Regarding the length limit: As a test I tried just now to set it to 1 for all ZZZ links. This didn't help even a little...

I validated with the following printout that this setting is used by the power prune:

static bool possible_connection(prune_context *pc,
    ...
    dist = rword - lword;
    if (lc->length_limit == 1 || rc->length_limit == 1) printf("ZZZ %d\n", dist); <---------------
    if (dist > lc->length_limit || dist > rc->length_limit) return false;

What can I try next?

Profiling shows the 5 sec of the power pruning for our test sentence is inside the function right_table_search().

(Regarding the morphology links: several days ago I tested setting their distance to 1. It doesn't reduce the pruning time in any measurable way. I also validated that even without that the parser never actually tries to match morphology links with distance > 1.)

linas commented 8 years ago

I did this and it didn't help:

--- a/link-grammar/preparation.c
+++ b/link-grammar/preparation.c
@@ -35,6 +35,7 @@ set_connector_list_length_limit(Connector *c,
                {
                        c->length_limit = short_len;
                }
+if (0 == strncmp(c->string, "ZZZ", 3)) { c->length_limit = 1; }
        }
 }
linas commented 8 years ago

Maybe c->length_limit = 1; didn't make a difference, because prune would mark these as "shallow" anyway.

ampli commented 8 years ago

Maybe I found the problem. It seems the pruning length_limit check is done in a too deep level (actually in the maximum depth) to be any useful in saving CPU time.

I tried to do it in one higher level and this seem to eliminate most of the overhead. It seems it can be done even in a higher level.

static bool
right_table_search(prune_context *pc, int w, Connector *c,
                   bool shallow, int word_c)
{
    unsigned int size, h;
    C_list *cl;
    power_table *pt;

    if (word_c-w > c->length_limit) return false; // <--------------

    pt = pc->pt;
    size = pt->r_table_size[w];
    h = connector_hash(c) & (size-1);
    for (cl = pt->r_table[w][h]; cl != NULL; cl = cl->next)
    {
        if (possible_connection(pc, cl->c, c, cl->shallow, shallow, w, word_c))
        {
            return true;
        }
    }
    return false;
}

/**
 * This returns TRUE if the right table of word w contains
 * a connector that can match to c.  shallows tells if c is shallow
 */
static bool
left_table_search(prune_context *pc, int w, Connector *c,
                  bool shallow, int word_c)
{
    unsigned int size, h;
    C_list *cl;
    power_table *pt;

    if (w-word_c > c->length_limit) return false; // <--------------

    pt = pc->pt;
    size = pt->l_table_size[w];
    h = connector_hash(c) & (size-1);
    for (cl = pt->l_table[w][h]; cl != NULL; cl = cl->next)
    {
        if (possible_connection(pc, c, cl->c, shallow, cl->shallow, word_c, w))
            return true;
    }
    return false;
}

Another observation is that all of this prunning stuff has a vast optimization opportunities.

linas commented 8 years ago

well, it has easy_match still in it... :-)

linas commented 8 years ago

OK, I just pushed the length-limit test .. however, there still is a question -- why did the empty word cause the size of those tables to blow up?

ampli commented 8 years ago

Regarding length-limit, it seems it will even be more effective to make this check in left_connector_update() / right_connector_update(). And then we can repeat the test of setting the LL* links to 1.

Regarding the big initial table size, my guess for now is that the expression prune is not effective at all with the presence of ZZZ-. Still need to check why.

BTW, the tables in the last pruning step, with and without ZZZ- in the sentense, are about the same size (to eliminate ZZZ- I changed both occurrences of KinD to KinxD).

[I will continue with these checks in about 2 hours.]

linas commented 8 years ago

OK. I just moved the check to right_connector_update() etc

This is bizarre/unexpected: adding if (0 == strncmp(c->string, "ZZZ", 3)) { c->length_limit = 1; } back into preparation.c now causes: No complete linkages found.

ampli commented 8 years ago

The check there should be different. More on that later.

linas commented 8 years ago

OK I'm done for now, I have other things I'm supposed to be doing. BTW, these changes make almost imperceptible changes to the run-time of corpus-fixes.batch

linas commented 8 years ago

Oh, BTW, here's one that used to run in 145 seconds (not sure which version, maybe 4.8.6???) but now takes 7+ minutes.

Cortes in his various letters again and again claims the Emperor's patronage of his bold defiance of the Emperor's officers on the ground that the latter in their action were moved solely by considerations of their personal gain, whereas he, Cortes, was striving to endow his sovereign with a rich new empire and boundless treasure whilst carrying into the dark pagan land, at the sword's point, the gentle creed of the Christian God.

linas commented 8 years ago

also, this statement does not make sense:

BTW, the tables in the last pruning step, with and without ZZZ- in the sentense, are about the same size

When I make the KinD -> KinxD substitution, the run-time gets 2x faster. That means these tables must be shorter, right?

linas commented 8 years ago

-v=3 prints this message (with ZZZ links):

Info: sane_morphism(): 828 of 1000 linkages had invalid morphology construction

which is very unexpected to me. -- this suggests the ZZZ links are failing to be length=1

ampli commented 8 years ago

Oh, BTW, here's one that used to run in 145 seconds (not sure which version, maybe 4.8.6???) but now takes 7+ minutes.

This is due to the last commit (of moving the check to right_connector_update() etc.). It also takes much time for me with that. Without this last commit it takes for me ~8 seconds (5.3.3 with 4.8.6 dict), which is also not good enough. I will try to find the problem with this last commit and in general.

also, this statement does not make sense:

BTW, the tables in the last pruning step, with and without ZZZ- in the sentense, are about the same size

When I make the KinD -> KinxD substitution, the run-time gets 2x faster. That means these tables must be shorter, right?

I printed the tables size, and they are about the same in their final form. However, with ZZZ the prune step takes much more CPU time.

I guess that fixing the prune slowness due to ZZZ will improve the speed of Russian too, which was always relatively slow.

ampli commented 8 years ago

As a test, I have set the length_limit of ZZZ and LL* to 1 before expression_prune(), and modified it to consider it (a small modification).

As usual, I produced detailed linkage output (with -links=10000) of "en" basic and fixes and "ru" basic corpuses. Here are the results:

Corpus Diffs
en basic No
en fixes One sentence only [*]
ru basic Only in some sentences with no complete linkages
test sentence [**] Cannot compare - combinatorial explosion with randomly selected linkages

I then ran a benchmark (30 runs, as 6 times set of best-of-5) using the current dict. The results are astonishing.

Corpus Before After % Faster
en basic 4.61 4.5 3
en fixes 17.2 16.7 3
ru basic 13.9 7.52 46
test sentence [**] 10.6 7.69 60

*[] Mr. Johnson, who was working in his field that morning, said, "The alien spaceship appeared right before my own two eyes." [] In vivo studies ... (using -short=20, as in fix-long, and of course -spell=0).

Questions:

  1. Why power_prune() is apparently so ineffective? Do these results hint that it has a problem?
  2. Why length_limit is currently set only before power_prune()? It seems that pruning the connectors that cannot mach due to length_limit earlier - in expression_prune(), has the benefit of converting much shorter expressions to disjuncts, and then a lighter power_prune() may result. Did I miss something?
  3. It may be interesting to investigate the difference in sentence [*]. This sentence has double-quotes (has ZZZ- connector). It also has ZZZ- at word 2: word1: Mr. alt0: Mr. ∅ alt1: Mr .

How can we proceed from here?

BTW, some efficiency improvements can be done in expression_prune(), but I have a more radical idea regarding it and power_prune() that maybe worth investigating - for another issue.

linas commented 8 years ago

With the current dicts, "alien spaceship" should not use ZZZ for the quotes: I get this:

Found 20996 linkages (137 of 493 random linkages had no P.P. violations)
    Linkage 1, cost vector = (UNUSED=0 DIS= 6.10 LEN=106)

    +------------------------------------------>WV--------------------------------------
    |               +-----------------------------------Ss------------------------------
    |               |        +---------------------------Xc---------------------------+ 
    |               |        |     +--------------------MVpn--------------------+     | 
    +-------Wd------+--MX*r--+     +-------MVp------+----Js---+                 |     | 
    |       +---G---+     +Xd+-Ss*w+---Ost--+       |   +Ds**c+        +---DTn--+     | 
    |       |       |     |  |     |        |       |   |     |        |        |     | 
LEFT-WALL Mr..x Johnson.m , who was.v-d working.g in.r his field.n that.j-r morning.r , 

--->+-----------------------------------------QUc----------------------------------------+
----+----------------------------------------Xi----------------------------------------+ |
    +----------------->WV----------------->+                                           | |
    +------------Wd-----------+            |                 +----------Jp---------+   | |
    +--QUd-+  +-----Ds**x-----+            +-------MVp-------+     +----DD---+     |   | |
    +-Xc-+ |  |     +----AN---+-----Ss-----+---MVa---+       |     +-La-+    +-Dmcn+   | |
    |    | |  |     |         |            |         |       |     |    |    |     |   | |
said.q-d , " the alien.n spaceship.n appeared.v-d right.e before my.p own.a two eyes.n . " 

Note the QUd and QUc links to the quotes.

linas commented 8 years ago

re strange power-prune: back in the bad old days of fat links, this routines were much more complicated, as they had to prune portions of a sentence but not all of it, in strange ways. Now that the code is simpler, you are probably seeing left-overs from that more complicated design. For example, length pruning would not be possibles in expressions, due to fat links, because you didn't know how long a link was until after you hopped over the fat parts of the sentence. Now, you can do this ... and should.

linas commented 8 years ago

Due to the simplification of removing the fat links, there may be some fair amount of other simplification that could be done.

p.s. I plan to publish 4.3.3 in a few days or a week, for the OSX build bug

ampli commented 8 years ago

With the current dicts, "alien spaceship" should not use ZZZ for the quotes: I get this:

But most of these links in this sentence are currently still ZZZ. I used -links=30000 and got 10550 links (the rest have "invalid morphology"). However, due to the DISPLAY_MAX 1024, only 1024 links are displayed. From them in 894 the quotes appear with ZZZ, and only in 130 they appear with QUd.

Even though there is a diff output of these 1024 linkages after my expression_prune() length_limit modification, for reason I still need to investigate, the number of sentences with ZZZ and with QUd is exactly the same.

Regarding possible speed-ups in expression_prune() - there are two ones that are easy to implement:

  1. The current code allocates a "dummy" connector struct before trying to insert to the set, when most of the times the connector is already in the set - can be allocated only when not in the set.
  2. The connector structs can be recycled instead of deallocated/allocated.

Now, you can do this ... and should.

I can clean up my addition of that, for a pull request. However, I'm not sure how to cleanly set the length_limit. It is not possible to do like what is done before power_prune(). This is because we have still no connectors in which we can set the length_limits. So I implemented it in insert_connectors():

if (string_set_cmp(ZZZ, e->u.string) || 0 == strncmp(e->u.string, "LL", 2))
                dummy->length_limit = 1;

However, adding this way the short lengths will have much more overhead. Please advise...

Just a note: It could be that in some languages LL links will not be able to have a length of 1. So we may need some way to indicate their maximum supported length_limit. A problem can happen even for languages in which the LL links are between stem and suffixes, if the word can also be broken to alternatives - empty words can be inserted between the stem and the suffix, this is "uncontrolled" since it may be done by the flattening code. This needs to be investigated, maybe also by adding code to check for that. (If this happens, maybe the pruning code can be changed to ignore empty words when checking the length limit - needs to be done only if a very short length_limit is violated so it is not much overhead.)

A better way can be to do that just after reading the dictionary, and create "connector descriptors" with all sort of preparations, like hash value, whether they are in the unlimited set, connector length-limit type (all of that are now done per sentence), and connector enumeration values. The expressions will then use a pointer to a connector descriptor instead of to a connector string. Similarly, the connector structures will have pointers to the connector descriptors. Is this a good idea?

BTW, it will be possible to reduce the connector structs to only 16-bytes if instead of this connector descriptor pointer, they would include a uint16_t index of the connector descriptor table, and tableNext would be eliminated (moved to the connector table, or eliminated altogether due to use of arrays).

Alternatives are currently not considered while pruning (will open another issue).

Last thing for now (maybe need to open an issue for that): Why does the SAT parser use its own prune code?

linas commented 8 years ago

However, adding this way the short lengths will have much more overhead.

Sounds like a bad idea, then, we should not do it.

just after reading the dictionary, and create "connector descriptors"

Ohh! that sounds like a good idea! I like it!

uint16_t index of the connector descriptor table,

Sure why not. How many types of connectors does russian have? If connectors are auto-genned for Hebrew, might there ever be more than 65K of them?

Why does the SAT parser use its own prune code?

Don't know. It was written before fat links were removed; the fat-link stuff just made everything excessively complex.

ampli commented 8 years ago

just after reading the dictionary, and create "connector descriptors"

Ohh! that sounds like a good idea! I like it!

I will start a branch to implement it (it seems straight forward).

However, adding this way the short lengths will have much more overhead.

Sounds like a bad idea, then, we should not do it.

I will then be able to implement the above in a good way.

How many types of connectors does russian have?

About 4200.

If connectors are auto-genned for Hebrew, might there ever be more than 65K of them?

Didn't think yet on auto-generating links for Hebrew, but in any case I don't expect any large quantity of connector types.

linas commented 8 years ago

There might be more than 65K link types for the langauge-learning code. Not sure. Initially, each unique word-pair gets a unique link type. These are then supposed to be clustered together into much much smaller sets, but not sure at which point these get pushed back into LG. Whatever. its in the future.

ampli commented 8 years ago

Why does the SAT parser use its own prune code?

Don't know. It was written before fat links were removed; the fat-link stuff just made everything excessively complex.

If the prune code in the main library will get improved, it will be a good idea to remove the prune code from the SAT parser (maybe will be a good idea to do it even before that...).

linas commented 8 years ago

Twiddling SAT seems like a low priority. I glanced at the code, it wasn't a cut-n-paste, but doing some kind of SAT stuff.

ampli commented 8 years ago

The variable holding the connector number can be defined as a typedef, so it can be int if needed.

linas commented 8 years ago

yes, please!

linas commented 7 years ago

FWIW: 30 Jan 2017: on "fanny" (I believe the same machine as the above posts) gives 9 seconds for top-most parse. version pre-5.3.15, where the EMPTY_WORD device has been removed.

version 5.3.14 gives about 10.5 seconds. Note that the above pre-5.3.15 also has the string-set hashing fix, which was measured to give a 10% or 15% per improvement on the batch files, so remove of empty-word seems to have had no effect on performance!(?)

ampli commented 7 years ago

so remove of empty-word seems to have had no effect on performance!(?)

Indeed there is no performance gain in the classic parser from eliminating EMPTY_WORD. (The adde code for that in sat-parser even has a negative effect on its performance but I haven't tried yet to investigate that.)

ampli commented 6 years ago

@linas, Here is a summary of this issue.

The apparent longer times for the demo sentence.

  1. The long times are due to spelling differences. When they cause null links, the parse time become much longer because the parse is tried several times (with an increased null-count in each).
  2. The dict also got more complicated over time (which causes parses to take slightly longer). So the comparisons I made just now where using the 4.8.6 dictionary with the newest LG library.
  3. I also used -short=10 -cost-max=2 as the defaults of 4.8.6 (in order to compare to its time).
  4. The remaining time difference is due to the aggressive unit split, that add more tokens in this sentence. Fixing it (unpublished) significantly reduces the parse time.

The result is that in 5.4.3 this sentence is actually parsed in less time than in 4.8.6 (less than 1 second for me, when in 4.8.6 it is slightly more than 1.1 second), with no null links. If you get much more than that (as indicated in your last post) I believe it is due to not disabling spelling, a thing that causes in this case linkages with null links that take much time to parse.

Other things that are mentioned in this issue:

  1. The aggressive unit separation that cause redundant tokens (as in issue #591).
  2. The connector enumeration idea, that I already implemented and it indeed makes the parsing much faster, also by its ability to reduce the connectors to 32 bytes (I intend to try a greater reduction too).
  3. The setting of ZZZ and morphology links length_limit to . It didn't seem to have impact on the linkage because we always tried them on linkages with null_count=0. Setting ZZZ as length_limit=1 (like it is done now) causes bogus linkages when there are null links. And setting morphology links as length_limit=1 makes linkages with null links much faster.

So I think this issue can be closed after the connector-enumeration version (which also fixes (3) above) is merged, supposing (1) is fixed by then (e.g. according to my proposal in issue #591) or deemed not to be worth fixing.

linas commented 6 years ago

Yes, OK. "fanny" is an old machine. For the version 5.4.3 parser and dicts, after subtracting 0.3 seconds needed to load the dictionary itself, I get:

12.9 seconds, no flags
8.1 seconds for -spell=0 
4.4  seconds for -short=10 -cost-max=2
3.4 seconds for -spell=0 -short=10 -cost-max=2
2.3 seconds for -spell=0 -short=10 -cost-max=2 for the 4.8.6 dicts with 5.4.3 parser
2.5 seconds for -spell=0 -short=10 -cost-max=2 for the 4.8.6 dicts+4.8.6 parser

So yes, the 5.4.3 parser is now faster than the 4.8.6 parser, when both use the same dict version

ampli commented 6 years ago

2.3 seconds for -spell=0 -short=10 -cost-max=2 for the 4.8.6 dicts with 5.4.3 parser

You will even get much less CPU time if you replace the D in KinD by e.g. X (in its both occurrences) , to avoid the aggressive unit splitting of this word (to K in D).

linas commented 6 years ago

KinD

Well, that's interesting. I replaced KinA by A, KinB by B, and so on, and if prints "No complete linkages found." very quickly, and then takes 3+ minutes -- the null count gets up to 10 or something like that. We seem not to have a max null-count option.

ampli commented 6 years ago

You have removed it some time ago. I guess its main usefulness is to limit the CPU waste for badly-unparsable sentences. If we add it again, I suggest to add a message like: "No linkages can be found with %d nulls or less." if the max null-count limit is exceeded.

linas commented 6 years ago

You have removed it some time ago.

Oops! See, I make mistakes! Just in case you weren't sure.

ampli commented 6 years ago

At least here it is my mistake, as from looking in the initial source code, this option was missing there too. (What you actually removed is the null-block option, which is fine and not related.)

I can add a support in link-parser for max null count as follows:

  1. Make null an `int`` option that will denote the maximum number of null words.
  2. Give it a reasonable default (in link-parser ), e.g. 7.
  3. The error message in case of no linkages and a limited null count, will be No linkages found within %d null links. (with unlimited null count it will be like now No linkages found.). Maybe the following can be added too Try to set the "null" variable to a higher value..
  4. The upper value will be truncated to 255, that will denote "unlimited".

However, I'm not sure about its usefulness, since for parsing a large number of sentences I guess people would use their own program, and the library already supports max_null.

linas commented 4 years ago

closing