Closed linas closed 4 years ago
Hi Linas,
Consider that the default constraints setup is in favor of the old version. In 4.8.6 we have: cost-max 2.00 short 10 While in 5.3.3 we have: cost-max 2.70 short 16
Besides of this, we must disable spelling on such tests, since different versions handle it differently (different alternatives may be generated).
Also, the latest improvements concentrated on batch processing.
So I suggest to make the benchmark with similar parameters:
link-parser DICT -spell=0 -short=10 -max-cost=2 -echo -batch -timeout=300
(Or: link-parser DICT -spell=0 -short=16 -max-cost=2.7 -echo -batch -timeout=300
)
But even with this (-short=10 -max-cost=2) we get a large performance difference in favor of the old versions (using the 4.8.6 dict): 4.8.6: 1.31s 5.3.3: 6.11s
Using -verbosity=5 reveals a possible problem in 5.3.3:
power prune cost: 9427707
++++power pruned 4.76 seconds
While in 4.8.6:
2858193 power prune cost:
++++power pruned (ruthless) 0.63 seconds
We need to do bisections to find out when this apparent pruning problem happened.
Also note that in 5.3.3 sentences are often split to more tokens, and as we know, the parsing algo is very sensitive to the number of tokens.
This extra splitting is because the unit-split tokenizing is very aggressive, and if a word consists all of units, it breaks it even without a root number. In our case it breaks (in 2 places) the word KinD
into
3 tokens K
in
D
. Even word like As
and "sin" (not found in this sentence) are getting split to A
s
and s
in
, respectively. This may also cause more "valid" linkages, even though I don't know about any such examples. I also don't recall having in our corpus examples that need such an aggressive split. We may try to ifdef it out.
I forgot to add: Before doing the check above, I added EMPTY-WORD.zzz: ZZZ-;
to the 4.8.6 dict. This is a must for library versions 5.3.x.
More info on the problem: This is a good version: 908507b54ba44879de18945d219836c66f1c8a53
Here an inefficiency of the pruning (for this test sentence) begins: 3bce1cf0bf67b1225ef7779eb984631424db7c6f The reason is a more aggressive unit split (still with no EMPTY-WORD), On our test sentence, the said power prune step time increased from ~0.7s to ~1.05 sec.
Here the pruning of this sentence becomes a total time hog (~5s): f73fa61cb720830d1da3de37c702e62f0888a9dd
For some reason the power pruning misbehaves when it encounters empty words.
Here's why power-pruning misbehaves with empty words: its the length limit. We know that an empty word might connect to it's neighbor, but no farther. The pruner has to assume it may connect far away, i.e. hop over words. This suggests that we need a way to tell the pruner that there are links, the morphology links, the empty-word link, and the phonology links, that can ONLY connect to nearest neighbors. They're like "short links" but are super-short links
Looks like its easy to hack a test: go into preparation.c if link-type=ZZZ then c->length_limit = 1;
Having a flag to enable/disable aggressive splitting is OK. If a word already is in the dict, it should not be split. What we need is an indicator: this word is a common word; don't split it. this word is an unusual word; do split it.
I also thing that splitting words to 1-letter unit-names parts, especially when there is no root number, is not constructional and can be eliminated. This will prevent most of these splits.
Regarding the length limit: As a test I tried just now to set it to 1 for all ZZZ links. This didn't help even a little...
I validated with the following printout that this setting is used by the power prune:
static bool possible_connection(prune_context *pc,
...
dist = rword - lword;
if (lc->length_limit == 1 || rc->length_limit == 1) printf("ZZZ %d\n", dist); <---------------
if (dist > lc->length_limit || dist > rc->length_limit) return false;
What can I try next?
Profiling shows the 5 sec of the power pruning for our test sentence is inside the function right_table_search().
(Regarding the morphology links: several days ago I tested setting their distance to 1. It doesn't reduce the pruning time in any measurable way. I also validated that even without that the parser never actually tries to match morphology links with distance > 1.)
I did this and it didn't help:
--- a/link-grammar/preparation.c
+++ b/link-grammar/preparation.c
@@ -35,6 +35,7 @@ set_connector_list_length_limit(Connector *c,
{
c->length_limit = short_len;
}
+if (0 == strncmp(c->string, "ZZZ", 3)) { c->length_limit = 1; }
}
}
Maybe c->length_limit = 1;
didn't make a difference, because prune would mark these as "shallow" anyway.
Maybe I found the problem. It seems the pruning length_limit check is done in a too deep level (actually in the maximum depth) to be any useful in saving CPU time.
I tried to do it in one higher level and this seem to eliminate most of the overhead. It seems it can be done even in a higher level.
static bool
right_table_search(prune_context *pc, int w, Connector *c,
bool shallow, int word_c)
{
unsigned int size, h;
C_list *cl;
power_table *pt;
if (word_c-w > c->length_limit) return false; // <--------------
pt = pc->pt;
size = pt->r_table_size[w];
h = connector_hash(c) & (size-1);
for (cl = pt->r_table[w][h]; cl != NULL; cl = cl->next)
{
if (possible_connection(pc, cl->c, c, cl->shallow, shallow, w, word_c))
{
return true;
}
}
return false;
}
/**
* This returns TRUE if the right table of word w contains
* a connector that can match to c. shallows tells if c is shallow
*/
static bool
left_table_search(prune_context *pc, int w, Connector *c,
bool shallow, int word_c)
{
unsigned int size, h;
C_list *cl;
power_table *pt;
if (w-word_c > c->length_limit) return false; // <--------------
pt = pc->pt;
size = pt->l_table_size[w];
h = connector_hash(c) & (size-1);
for (cl = pt->l_table[w][h]; cl != NULL; cl = cl->next)
{
if (possible_connection(pc, c, cl->c, shallow, cl->shallow, word_c, w))
return true;
}
return false;
}
Another observation is that all of this prunning stuff has a vast optimization opportunities.
well, it has easy_match still in it... :-)
OK, I just pushed the length-limit test .. however, there still is a question -- why did the empty word cause the size of those tables to blow up?
Regarding length-limit, it seems it will even be more effective to make this check in left_connector_update() / right_connector_update(). And then we can repeat the test of setting the LL* links to 1.
Regarding the big initial table size, my guess for now is that the expression prune is not effective at all with the presence of ZZZ-
. Still need to check why.
BTW, the tables in the last pruning step, with and without ZZZ-
in the sentense, are about the same size (to eliminate ZZZ-
I changed both occurrences of KinD to KinxD).
[I will continue with these checks in about 2 hours.]
OK. I just moved the check to right_connector_update() etc
This is bizarre/unexpected: adding if (0 == strncmp(c->string, "ZZZ", 3)) { c->length_limit = 1; }
back into preparation.c now causes: No complete linkages found.
The check there should be different. More on that later.
OK I'm done for now, I have other things I'm supposed to be doing. BTW, these changes make almost imperceptible changes to the run-time of corpus-fixes.batch
Oh, BTW, here's one that used to run in 145 seconds (not sure which version, maybe 4.8.6???) but now takes 7+ minutes.
Cortes in his various letters again and again claims the Emperor's patronage of his bold defiance of the Emperor's officers on the ground that the latter in their action were moved solely by considerations of their personal gain, whereas he, Cortes, was striving to endow his sovereign with a rich new empire and boundless treasure whilst carrying into the dark pagan land, at the sword's point, the gentle creed of the Christian God.
also, this statement does not make sense:
BTW, the tables in the last pruning step, with and without ZZZ- in the sentense, are about the same size
When I make the KinD -> KinxD substitution, the run-time gets 2x faster. That means these tables must be shorter, right?
-v=3 prints this message (with ZZZ links):
Info: sane_morphism(): 828 of 1000 linkages had invalid morphology construction
which is very unexpected to me. -- this suggests the ZZZ links are failing to be length=1
Oh, BTW, here's one that used to run in 145 seconds (not sure which version, maybe 4.8.6???) but now takes 7+ minutes.
This is due to the last commit (of moving the check to right_connector_update() etc.). It also takes much time for me with that. Without this last commit it takes for me ~8 seconds (5.3.3 with 4.8.6 dict), which is also not good enough. I will try to find the problem with this last commit and in general.
also, this statement does not make sense:
BTW, the tables in the last pruning step, with and without ZZZ- in the sentense, are about the same size
When I make the KinD -> KinxD substitution, the run-time gets 2x faster. That means these tables must be shorter, right?
I printed the tables size, and they are about the same in their final form. However, with ZZZ the prune step takes much more CPU time.
I guess that fixing the prune slowness due to ZZZ will improve the speed of Russian too, which was always relatively slow.
As a test, I have set the length_limit of ZZZ and LL* to 1 before expression_prune(), and modified it to consider it (a small modification).
As usual, I produced detailed linkage output (with -links=10000
) of "en" basic and fixes and "ru" basic corpuses. Here are the results:
Corpus | Diffs |
---|---|
en basic | No |
en fixes | One sentence only [*] |
ru basic | Only in some sentences with no complete linkages |
test sentence [**] | Cannot compare - combinatorial explosion with randomly selected linkages |
I then ran a benchmark (30 runs, as 6 times set of best-of-5) using the current dict. The results are astonishing.
Corpus | Before | After | % Faster |
---|---|---|---|
en basic | 4.61 | 4.5 | 3 |
en fixes | 17.2 | 16.7 | 3 |
ru basic | 13.9 | 7.52 | 46 |
test sentence [**] | 10.6 | 7.69 | 60 |
*[] Mr. Johnson, who was working in his field that morning, said, "The alien spaceship appeared right before my own two eyes."
[] In vivo studies ... (using -short=20
, as in fix-long, and of course -spell=0
).
Questions:
word1: Mr. alt0: Mr. ∅ alt1: Mr .
How can we proceed from here?
BTW, some efficiency improvements can be done in expression_prune(), but I have a more radical idea regarding it and power_prune() that maybe worth investigating - for another issue.
With the current dicts, "alien spaceship" should not use ZZZ for the quotes: I get this:
Found 20996 linkages (137 of 493 random linkages had no P.P. violations)
Linkage 1, cost vector = (UNUSED=0 DIS= 6.10 LEN=106)
+------------------------------------------>WV--------------------------------------
| +-----------------------------------Ss------------------------------
| | +---------------------------Xc---------------------------+
| | | +--------------------MVpn--------------------+ |
+-------Wd------+--MX*r--+ +-------MVp------+----Js---+ | |
| +---G---+ +Xd+-Ss*w+---Ost--+ | +Ds**c+ +---DTn--+ |
| | | | | | | | | | | | |
LEFT-WALL Mr..x Johnson.m , who was.v-d working.g in.r his field.n that.j-r morning.r ,
--->+-----------------------------------------QUc----------------------------------------+
----+----------------------------------------Xi----------------------------------------+ |
+----------------->WV----------------->+ | |
+------------Wd-----------+ | +----------Jp---------+ | |
+--QUd-+ +-----Ds**x-----+ +-------MVp-------+ +----DD---+ | | |
+-Xc-+ | | +----AN---+-----Ss-----+---MVa---+ | +-La-+ +-Dmcn+ | |
| | | | | | | | | | | | | | |
said.q-d , " the alien.n spaceship.n appeared.v-d right.e before my.p own.a two eyes.n . "
Note the QUd and QUc links to the quotes.
re strange power-prune: back in the bad old days of fat links, this routines were much more complicated, as they had to prune portions of a sentence but not all of it, in strange ways. Now that the code is simpler, you are probably seeing left-overs from that more complicated design. For example, length pruning would not be possibles in expressions, due to fat links, because you didn't know how long a link was until after you hopped over the fat parts of the sentence. Now, you can do this ... and should.
Due to the simplification of removing the fat links, there may be some fair amount of other simplification that could be done.
p.s. I plan to publish 4.3.3 in a few days or a week, for the OSX build bug
With the current dicts, "alien spaceship" should not use ZZZ for the quotes: I get this:
But most of these links in this sentence are currently still ZZZ.
I used -links=30000
and got 10550 links (the rest have "invalid morphology").
However, due to the DISPLAY_MAX 1024
, only 1024 links are displayed.
From them in 894 the quotes appear with ZZZ, and only in 130 they appear with QUd.
Even though there is a diff output of these 1024 linkages after my expression_prune() length_limit modification, for reason I still need to investigate, the number of sentences with ZZZ and with QUd is exactly the same.
Regarding possible speed-ups in expression_prune() - there are two ones that are easy to implement:
Now, you can do this ... and should.
I can clean up my addition of that, for a pull request. However, I'm not sure how to cleanly set the length_limit. It is not possible to do like what is done before power_prune(). This is because we have still no connectors in which we can set the length_limits. So I implemented it in insert_connectors():
if (string_set_cmp(ZZZ, e->u.string) || 0 == strncmp(e->u.string, "LL", 2))
dummy->length_limit = 1;
However, adding this way the short lengths will have much more overhead. Please advise...
Just a note: It could be that in some languages LL links will not be able to have a length of 1. So we may need some way to indicate their maximum supported length_limit. A problem can happen even for languages in which the LL links are between stem and suffixes, if the word can also be broken to alternatives - empty words can be inserted between the stem and the suffix, this is "uncontrolled" since it may be done by the flattening code. This needs to be investigated, maybe also by adding code to check for that. (If this happens, maybe the pruning code can be changed to ignore empty words when checking the length limit - needs to be done only if a very short length_limit is violated so it is not much overhead.)
A better way can be to do that just after reading the dictionary, and create "connector descriptors" with all sort of preparations, like hash value, whether they are in the unlimited set, connector length-limit type (all of that are now done per sentence), and connector enumeration values. The expressions will then use a pointer to a connector descriptor instead of to a connector string. Similarly, the connector structures will have pointers to the connector descriptors. Is this a good idea?
BTW, it will be possible to reduce the connector structs to only 16-bytes if instead of this connector descriptor pointer, they would include a uint16_t index of the connector descriptor table, and tableNext would be eliminated (moved to the connector table, or eliminated altogether due to use of arrays).
Alternatives are currently not considered while pruning (will open another issue).
Last thing for now (maybe need to open an issue for that): Why does the SAT parser use its own prune code?
However, adding this way the short lengths will have much more overhead.
Sounds like a bad idea, then, we should not do it.
just after reading the dictionary, and create "connector descriptors"
Ohh! that sounds like a good idea! I like it!
uint16_t index of the connector descriptor table,
Sure why not. How many types of connectors does russian have? If connectors are auto-genned for Hebrew, might there ever be more than 65K of them?
Why does the SAT parser use its own prune code?
Don't know. It was written before fat links were removed; the fat-link stuff just made everything excessively complex.
just after reading the dictionary, and create "connector descriptors"
Ohh! that sounds like a good idea! I like it!
I will start a branch to implement it (it seems straight forward).
However, adding this way the short lengths will have much more overhead.
Sounds like a bad idea, then, we should not do it.
I will then be able to implement the above in a good way.
How many types of connectors does russian have?
About 4200.
If connectors are auto-genned for Hebrew, might there ever be more than 65K of them?
Didn't think yet on auto-generating links for Hebrew, but in any case I don't expect any large quantity of connector types.
There might be more than 65K link types for the langauge-learning code. Not sure. Initially, each unique word-pair gets a unique link type. These are then supposed to be clustered together into much much smaller sets, but not sure at which point these get pushed back into LG. Whatever. its in the future.
Why does the SAT parser use its own prune code?
Don't know. It was written before fat links were removed; the fat-link stuff just made everything excessively complex.
If the prune code in the main library will get improved, it will be a good idea to remove the prune code from the SAT parser (maybe will be a good idea to do it even before that...).
Twiddling SAT seems like a low priority. I glanced at the code, it wasn't a cut-n-paste, but doing some kind of SAT stuff.
The variable holding the connector number can be defined as a typedef, so it can be int
if needed.
yes, please!
FWIW: 30 Jan 2017: on "fanny" (I believe the same machine as the above posts) gives 9 seconds for top-most parse. version pre-5.3.15, where the EMPTY_WORD device has been removed.
version 5.3.14 gives about 10.5 seconds. Note that the above pre-5.3.15 also has the string-set hashing fix, which was measured to give a 10% or 15% per improvement on the batch files, so remove of empty-word seems to have had no effect on performance!(?)
so remove of empty-word seems to have had no effect on performance!(?)
Indeed there is no performance gain in the classic parser from eliminating EMPTY_WORD. (The adde code for that in sat-parser even has a negative effect on its performance but I haven't tried yet to investigate that.)
@linas, Here is a summary of this issue.
The apparent longer times for the demo sentence.
-short=10 -cost-max=2
as the defaults of 4.8.6 (in order to compare to its time).The result is that in 5.4.3 this sentence is actually parsed in less time than in 4.8.6 (less than 1 second for me, when in 4.8.6 it is slightly more than 1.1 second), with no null links. If you get much more than that (as indicated in your last post) I believe it is due to not disabling spelling, a thing that causes in this case linkages with null links that take much time to parse.
Other things that are mentioned in this issue:
null_count=0
.
Setting ZZZ as length_limit=1 (like it is done now) causes bogus linkages when there are null links.
And setting morphology links as length_limit=1 makes linkages with null links much faster.So I think this issue can be closed after the connector-enumeration version (which also fixes (3) above) is merged, supposing (1) is fixed by then (e.g. according to my proposal in issue #591) or deemed not to be worth fixing.
Yes, OK. "fanny" is an old machine. For the version 5.4.3 parser and dicts, after subtracting 0.3 seconds needed to load the dictionary itself, I get:
12.9 seconds, no flags
8.1 seconds for -spell=0
4.4 seconds for -short=10 -cost-max=2
3.4 seconds for -spell=0 -short=10 -cost-max=2
2.3 seconds for -spell=0 -short=10 -cost-max=2 for the 4.8.6 dicts with 5.4.3 parser
2.5 seconds for -spell=0 -short=10 -cost-max=2 for the 4.8.6 dicts+4.8.6 parser
So yes, the 5.4.3 parser is now faster than the 4.8.6 parser, when both use the same dict version
2.3 seconds for -spell=0 -short=10 -cost-max=2 for the 4.8.6 dicts with 5.4.3 parser
You will even get much less CPU time if you replace the D
in KinD
by e.g. X
(in its both occurrences) , to avoid the aggressive unit splitting of this word (to K in D
).
KinD
Well, that's interesting. I replaced KinA by A, KinB by B, and so on, and if prints "No complete linkages found." very quickly, and then takes 3+ minutes -- the null count gets up to 10 or something like that. We seem not to have a max null-count option.
You have removed it some time ago. I guess its main usefulness is to limit the CPU waste for badly-unparsable sentences. If we add it again, I suggest to add a message like: "No linkages can be found with %d nulls or less." if the max null-count limit is exceeded.
You have removed it some time ago.
Oops! See, I make mistakes! Just in case you weren't sure.
At least here it is my mistake, as from looking in the initial source code, this option was missing there too. (What you actually removed is the null-block option, which is fine and not related.)
I can add a support in link-parser
for max null count as follows:
null
an `int`` option that will denote the maximum number of null words.No linkages found within %d null links.
(with unlimited null count it will be like now No linkages found.
). Maybe the following can be added too Try to set the "null" variable to a higher value.
.However, I'm not sure about its usefulness, since for parsing a large number of sentences I guess people would use their own program, and the library already supports max_null.
closing
Amir: consider this:
how long does this take? With todays 5.3.3 parser and 5.3.3 dicts: 48 seconds. Some other times:
i.e. all of the above with todays parser, but other dicts.
But the 4.8.6 parser with the 4.8.6 dicts parsed this sentence in 8 seconds. So, despite performance improvements, things have gotten a lot worse.
Some of this I understand: changes to the dicts do add more possibilities. Some I don't: why is the the 5.3.3 parser 4x slower than the 4.8.6 parser, when both use the same dict? According to my notes, I once parsed the above in under 4 seconds, using 5.1.0 ... so .. wtf