opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
389 stars 119 forks source link

Fixing panic mode #785

Open ampli opened 6 years ago

ampli commented 6 years ago

These is the current setup of panic mode:

    parse_options_set_disjunct_cost(opts, 4.0f);
    parse_options_set_min_null_count(opts, 1);
    parse_options_set_max_null_count(opts, 100);
    parse_options_set_max_parse_time(opts, 60);
    parse_options_set_islands_ok(opts, false);
    parse_options_set_short_length(opts, 12);
    parse_options_set_all_short_connectors(opts, true);
    parse_options_set_linkage_limit(opts, 100);
    parse_options_set_spell_guess(opts, 0);

Proposed fixes:

  1. If !null=0, I propose to set min_null_count=0 and max_null_count=0 since there is no reason to allow parsing with nulls in that case.
  2. Copy islands_ok from the regular setup instead of hard-coding it to false.
  3. Don't set short_length and linkage_limit to more than the current ones.
  4. Document the user-variable prefix panic_.
  5. Add DEFAUT-COST and PANIC-COST` optional dict entries (this will address issue #390).
  6. Add panic timeout to the SAT parser.
linas commented 6 years ago

This sounds very reasonable. Could even set linkage_limit to be maybe 1/2 or 1/4th of the current limit (but at least one).

Likewise, max_time could be set to current max_time, or maybe even half of the current max_time. Or double max_time, but double seems not too wise.

ampli commented 4 years ago

I'm now trying to produce a PR for fixing this issue and found something that looks like a bug. This is the condition for doing a panic parse: https://github.com/opencog/link-grammar/blob/972ea245bde7365ee6942610154da88a46d12c1f/link-parser/link-parser.c#L1005-L1008 Note the https://github.com/opencog/link-grammar/blob/972ea245bde7365ee6942610154da88a46d12c1f/link-parser/link-parser.c#L1005 condition and consider. This check is old (from the start of version 4).

Consider this: When the timer is exhausted, nested do_count() calls return count zero, originally on connector endpoints that are not in the table, and unconditionally after I added the connector-word cache. In both cases, the number of parses may be non-zero but the parses are then most probably incomplete. However, link-parser just ignores this condition and may print an incomplete parse without any indication that it is problematic.

ampli commented 3 years ago

Edited for minor fixes.

I'm in the last step of producing a PR to fix most of these problems and some others. Among other things, I added a link-parser command !panic_variables that exposes the panic variables and help text on that and the settings and usage of the panic variables, and added handling of SIGWINCH if available. I also fixed some unrelated bugs in the touched code.

Some things remained unresolved and need a discussion:

  1. Add DEFAULT-COST and PANIC-COST` optional dict entries (this will address issue #390).

I think they should better be called <default-max-disjunct-cost> and <panic-max-disjunct-cost>, something like:

<default-max-disjunct-cost>: COST2d7+;
<panic-max-disjunct-cost>: COST4d0+;

I encountered an implementation problem: <default-max-disjunct-cost> is private to the library. However, <panic-max-disjunct-cost> is for library users. and there is no proper API to read it. Possibilities:

  1. Use dictionary_lookup_list().
  2. Use a new API get_panic_max_disjunct_cost().
  3. Use a new dict construct for configuration definitions:
    %define variable value

    and define a new API char *get_dictionary_definition(char *variable) to get it.

(1) seems to me bad (need to use internal structures). (2) is maybe fine. (3) is my favorite.

Also, I guess it will be a good idea to modify jni-client.c parsing handling to be compatible with link-parser, especially in panic handling.

Another thing that I didn't fix, even though it looks easy, is adding timeout handling to the SAT parser. I will add it if I find time to apply the changes and ideas I have accumulated by now, which may speed it by very much (including using the power pruning of the classic parser and tracons).

So I still need a solution for implementing the dict costs. If for now, it is better to skip this part, I can send this PR right away.

linas commented 3 years ago

(3) is my favorite.

Yes, adding %define variable value would be very nice. I guess we could convert <dictionary-version-number> and <dictionary-locale> to that too? And maybe 1stCAP.zzz and nonCAP.zzz ...

jni-client.c compatible

Yes

linas commented 3 years ago

Maybe use #define instead of %define so that if I comment out line 2760 of 4.0.dict.m4 ther aren't any accidents?

https://github.com/opencog/link-grammar/blob/5015230c289d7fa2f2ddd2f23ca232897f6cc218/data/en/4.0.dict.m4#L2760-L2763

ampli commented 3 years ago

#define is indeed better than %define, and is more like #include.

ampli commented 3 years ago

I added #define in the dictionary (to be submitted as a separate PR). It doesn't currently support whitespace inside the value (because the existing dict reading code handles that as an error), but if needed the fix is trivial.

For backward compatibility (testing using old dicts) I propose to leave the old-style-definition code intact and use it if the new-style definitions are not found.

ampli commented 3 years ago

I encountered a problem in defining default-max-disjunct-cost in the dict: There is no nice way to set from it the Parse Option disjunct_cost. The only way that I can think of is to set it from the dict on the first call to sentence_split() or sentence_parse() unless the user has set it beforehand. But this is not so nice.

Possibilities:

  1. Don't use default-max-disjunct-cost (only use panic-max-disjunct-cost).
  2. Use the not-so-nice-way mentioned above (Parse Option is now per dictionary anyway).
ampli commented 3 years ago

Currently the API calls to get dictionary definitions are:

const char *linkgrammar_get_dict_version(Dictionary);
const char *linkgrammar_get_dict_locale(Dictionary);

The appear in link-includes.h under System initialization.

There is also this to get the dictionary language (there, under Functions to manipulate Dictionaries):

const char *dictionary_get_lang(Dictionary);

The question is how to call the API to get a dictionary definition. It seems it should be, for consistency:

const char *linkgrammar_get_dict_definition(Dictionary, const char *);

If consistency was not the case, I would prefer:

const char *dictionary_get_definition(Dictionary, const char *);

But instead, I will use the former.

linas commented 3 years ago

linkgrammar_get_dict_definition

Yes, consistency is important.

There is no nice way to set

Perhaps we should introduce linkgrammar_set_dict_max_disjunct_cost() and convert parse_options_set_disjunct_cost() into a no-op. A note in the code already suggests that this is the right thing to do:

https://github.com/opencog/link-grammar/blob/2fc92c070f3f7591218ae7cb138ce12b33499b81/link-grammar/api.c#L89-L95

ampli commented 3 years ago

linkgrammar_set_dict_max_disjunct_cost()

The problem is that there is no place to keep this setting other than Dictionary. However, you cannot set it there if more than one thread already uses it.

Possible solution:

  1. New API: linkgrammar_get_dict_max_disjunct_cost(Dictionary).
  2. The user has to use: parse_options_set_disjunct_cost(opts, linkgrammar_get_dict_max_disjunct_cost(dict)); before the first use of opts. Else a hard-coded value 2.7 would be used. (The not-so-nice solution would just use the default-max-disjunct-cost from the dict if available.)
  3. Your idea as a possible API addition: linkgrammar_set_dict_max_disjunct_cost(Dictionary, value) for setting the value for the given dictionary. To be used only when single-threaded or before other threads that would use the dictionary, start.

link-parser, as an example application, would then use (2).

linas commented 3 years ago

Why would someone want to keep changing the value? Normally, this (and all the options) are something that you set once, and never change again.

I also don't understand the multi-threaded complaint. Updating a single float in some struct is almost atomic; it seems very unlikely that other threads will be racing to read this value.

ampli commented 3 years ago

Say I have an application that monitors a chat system in multiple threads when the dict is shared. A sentence in some thread gets a combinatorial explosion and the used algo tries to reparse with a lower max disjunct cost. Or there is no full parse and it tries to reparse with a slightly higher disjunct cost.

Why would someone want to keep changing the value?

The above seems to me as a valid reason.

I also don't understand the multi-threaded complaint.

It is not related to a possible race condition, but to affecting other threads that shouldn't be affected.

linas commented 3 years ago

Once upon a time, long long ago, I attempted to keep a histogram of costs during counting, with an attempt to discover how the count varied as a function of cost. This code never worked very well; I don't recall why. One hope was to be able to dynamically adjust costs, as it became clear that a combinatoric explosion was about to happen.

You have now looked at, and thought about counting a lot more than I have. Is there some natural place to dynamically adjust max-cost? For example, as I recall, there is a line that looks like count += p->n_left * p->n_right and if either n_left or n_right is greater than 1000, that is clearly a problem area, and one could immediately set max-cost lower, if one knew how much lower to set it. If one knew, while counting, what the distribution was, where the edge was... is there a way to easily and cheaply know this?

ampli commented 3 years ago

If the reason for doing so is to get better parses, it may be better to do it after running mk_parse_set(), in a scan of extractor_t similarly to set_overflowed(). Selected disjuncts in extractor_t can then be disabled according to some criteria, like counts and costs.

One problem for me to implement that is that I don't understand how extract_links() works. But I may try to do that after some more discussion to clarify things.

ampli commented 3 years ago

Regarding a dict default-max-disjunct-cost definition, I propose to add both of linkgrammar_get_dict_max_disjunct_cost() and linkgrammar_set_dict_max_disjunct_cost(), and in the same time NOT to make parse_options_set_disjunct_cost() a no-op.

This way this addition would be totally backward compatible, provide a way to use it without interfering with other threads that already use the same dict (the doc of parse_options_set_disjunct_cost() may warn on that interfering possibility), and would be easy to use in new code if desired.

linas commented 3 years ago

So I guess that Parse_options will store a default max cost of -10000. Then, during parsing, if Parse_options->cost > -1000 then that value will be used; otherwise the value from the dictionary will be used? OK, that is a reasonable solution.

I don't understand how extract_links() works

Me neither. It's really weird. There's a clever trick in it that I never fully understood. But this happens after counting; I was hoping to make max cost dynamically adjustable during counting, to avoid overflows. (Indirectly, I am also interested in the actual distribution, for "scientific reasons", just to understand how the number of parses increases as the cost-max is increased.)

Another "scientifically interesting" question is how the number of post-processing rules varies, depending on max cost. Originally, post-processing was designed to prohibit bad low-cost parses; but perhaps some of the high-cost parses are prohibited as well? Is this a flat distribution? Does it slope? In which direction?

Some more science: in (real, natural) neurons, we have excitatory and inhibitory synapses. In DNA expression/transcription, it can be promoted and suppressed. In link-grammar, the connectors are the "excitatory" subsystem, and post-processing is the inhibitory subsystem. I am trying to better understand how correctly think about inhibitory systems whenever one has a connector-based network (such as link-grammar).

ampli commented 3 years ago

So I guess that Parse_options will store a default max cost of -10000. Then, during parsing, if Parse_options->cost > -1000 then that value will be used; otherwise the value from the dictionary will be used? OK, that is a reasonable solution.

In such an implementation, we need to decide what parse_options_get_disjunct_cost() would return before the first parse, the dummy initial value, or the hard-coded 2.7 one. I also guess that linkgrammar_get_dict_max_disjunct_cost() and linkgrammar_set_dict_max_disjunct_cost()` may not be needed.

I was hoping to make max cost dynamically adjustable during counting, to avoid overflows.

Cannot we avoid overflows using my suggested method if we discard "offending disjuncts" in extractor_t and recalculate the multiplications? I don't see why doing it on the fly while parsing is better, as at any point we don't know future parse results (i.e. we know one side of the multiplication and not the other one). I think that in extractor-t we have the full info that is needed to avoid overflows. The only problem I see is that there could be a combinatorial explosion in the number of ways to avoid overflows. But maybe some heuristics can be employed.

I am also interested in the actual distribution, for "scientific reasons"

It seems to me there is no great secrete there, as, for each sentence, the number of disjuncts monotonically increases with max-cost. If you have more disjuncts then you have more opportunities for creating links, and as a result, more opportunities that both sides of the multiplication would be non-zero, resulting in more parses.

Originally, post-processing was designed to prohibit bad low-cost parses

I thought its main purpose is to prohibit link combinations that cannot be prohibited (or is complex to prohibit) using the dict syntax (I encountered such a problem in my tries to construct a Hebrew dict.)

but perhaps some of the high-cost parses are prohibited as well? Is this a flat distribution? Does it slope? In which direction?

Cannot we just check the cost of the domains that are rejected by post-processing?

Some more science: in (real, natural) neurons, we have excitatory and inhibitory synapses. In DNA expression/transcription, it can be promoted and suppressed. In link-grammar, the connectors are the "excitatory" subsystem, and post-processing is the inhibitory subsystem. I am trying to better understand how correctly think about inhibitory systems whenever one has a connector-based network (such as link-grammar).

Another thing is that post-processing is about the relation of links and connectors, while the dict defines relations of connectors. It also allows specifying a kind of non-adjacent relations.

BTW, cannot you look at connector definitions as both allowing and denying, as they deny the connection of "unmatched" connectors? This is while the post-processing can only deny.

On the same occasion, an idea on post-processing and combinatorial explosions: Provide feedback between post-processing and extractor_t, in which when a domain construction is rejected, stuff will get marked in extractor_t to invalidate extracting the offending domain. After encountering a certain number of linkages without P.P. violations, the multiplications in extractr_t can be recalculated in a home to avoid overflows. This may have a non-negligible success rate because in long sentences there are only a few linkages w/o P.P violations per 10K (or even 100K) linkages.

I made the following test to see how a reduction of the number of disjuncts reduces overflows: In eliminating duplicate disjuncts I didn't consider the disjunct string (the remaining disjunct just represents multiple strings when each has its own cost). Then the number of overflows in the failures batch got reduced by about 3% (and the parse speed got increased by about 3%).

linas commented 3 years ago

So I guess that Parse_options will store a default max cost of -10000. Then, during parsing, if Parse_options->cost > -1000 then that value will be used; otherwise the value from the dictionary will be used? OK, that is a reasonable solution.

In such an implementation, we need to decide what parse_options_get_disjunct_cost() would return before the first parse, the dummy initial value, or the hard-coded 2.7 one.

For Russian (and other languages) it might be the case that 2.7 would be wrong. So it should return DEFAULT_MAX_COST which would be either #define DEFAULT_MAX_COST -10000 or (better yet) static const int DEFAULT_MAX_COST = -10000;.

I also guess that linkgrammar_get_dict_max_disjunct_cost() and linkgrammar_set_dict_max_disjunct_cost() may not be needed.

You'd still want to have linkgrammar_get_dict_max_disjunct_cost() because you don't know what's actually in the dict, otherwise. But the set would not be needed (or wanted; having it would be .. confusing.)

ampli commented 3 years ago

Implementation questions:

  1. I noted that dict definitions are randomly done in both dict-common.h and dict-defines.h. I think most, if not all, should be moved to dict-defines.h. In any case, I will put DEFAULT_MAX_COST there.
  2. What should be the exact names of the max-cost definitions? Is this OK?
    #define default_max_disjunct_cost 2.7
    #define panic_max_disjunct_cost 4.0
  3. We have already this unrelated function name defined in dict-impl.c and read-sql.c, which is misleading now. Should it be renamed, and if so, to which name?

BTW, isn't 4.0 too high for panic parsing, as it allows headlines? Also, panic is about parse time, not an inability to produce a full parse. I don't understand why increasing the max-cost is a good strategy in such a case, as it increases the parse time by much due to the added disjuncts.

ampli commented 3 years ago

either #define DEFAULT_MAX_COST -10000 or (better yet) static const int DEFAULT_MAX_COST = -10000;.

I propose to call it UNINITIALIZED_MAX_COST.

ampli commented 3 years ago

In case you would like me to change something (of course I can also make changes after I submit the PR), here is what I defined:

In dict-defines.h:

/* If the maximum disjunct cost is yet uninitialized, the value defined
 * in the dictionary (or if not defined then DEFAULT_MAX_COST) is used. */
static const double UNINITIALIZED_MAX_COST = -10000.0;
static const double DEFAULT_MAX_COST = 2.7.

Even though the max-cost default is 2.7, I added its definition to en/4.0.dict.m4 and to all the other dicts, as follows:

% The default largest disjunct cost to consider during parsing.
#define default_max_disjunct_cost 2.7

I added to link-includes.h (and also updated link-grammar.def):

link_public_api(const char *)
    linkgrammar_get_dict_definition(Dictionary, const char *);

The problem is that currently, the terminology in the source files refers to LEFT-WALL, etc. as a dictionary definition, so this may be confusing. Maybe linkgrammar_get_dict_define() would be better?

ampli commented 3 years ago

I made some changes.

  1. In the dict I used - instead of _: #define default-max-disjunct-cost 2.7 2, In .h I added: static char DEFAULT_MAX_COST_DEFNAME[] = "default-max-disjunct-cost";
ampli commented 3 years ago

Above, I said:

BTW, isn't 4.0 too high for panic parsing, as it allows headlines? Also, panic is about parse time, not an inability to produce a full parse. I don't understand why increasing the max-cost is a good strategy in such a case, as it increases the parse time by much due to the added disjunct.

Answering myself: It seems a higher max-cost may be needed because we also use a shorter connector length. For long sentences (in which panic mode usually may be triggered) this means we may not get parse without many null links unless the syntax gets loose somewhat.

(Not related: In my code quotes above I had several syntax problems. All got fixed and the code works fine now. I still have to complete the conversion of the version and domain definitions.)

ampli commented 3 years ago

@linas, I finished the dict #define library implementation and dict conversion. I will need your input regarding my naming concerns (mentioned above) to make fixes if needed.

ampli commented 3 years ago

Above, I said:

BTW, isn't 4.0 too high for panic parsing, as it allows headlines? Also, panic is about parse time, not an inability to produce a full parse. I don't understand why increasing the max-cost is a good strategy in such a case, as it increases the parse time by much due to the added disjunct.

Answering myself: It seems a higher max-cost may be needed because we also use a shorter connector length. For long sentences (in which panic mode usually may be triggered) this means we may not get parse without many null links unless the syntax gets loose somewhat.

I found another reason for parsing with a higher max-cost in panic mode: If we allow null-links, this may allow the sentence to parse with fewer ones, resulting in a faster parsing.

linas commented 3 years ago

Cannot we avoid overflows using my suggested method if we discard "offending disjuncts" in extractor_t and recalculate the multiplications?

I don't recall what that is... and I don;'t know how to search for that discussion

we know one side of the multiplication and not the other one

So you are saying that one side may be large, but the other side might be zero?

I am also interested in the actual distribution, for "scientific reasons"

It seems to me there is no great secrete there, as, for each sentence, the number of disjuncts monotonically increases with max-cost.

Is it linear? quadratic? cubic? For the hand-built English corpus, maybe it is not that interesting; it's the kind of thing I would look at for the automatically learned languages.

Originally, post-processing was designed to prohibit bad low-cost parses

I thought its main purpose is to prohibit link combinations that cannot be prohibited (or is complex to prohibit) using the dict syntax (I encountered such a problem in my tries to construct a Hebrew dict.)

Complexity is in the eye of the beholder. I always found it much, much easier to modify the dictionary, than the try to mess with post-processing. I suspect the original authors were trying to keep the dictionary as small as possible, and thought that they could do that with post-processing. So, now we have a large complex dictionary... is that good? bad? Is there a better way? I plan to spend more time thinking about higher-order mechanisms...

non-adjacent relations.

Well, with disjuncts, you can propagate constraints from one disjunct to the next, so it is not hard to create long-range, complex interlocking chains. I don't know if this is easier or harder than post-processing; I guess I am saying that I need to think about all this in a new way... thinking takes time.

failures batch.

OK, just realize this is a pathological batch, its a collection where the current English dict fails to work. It's not clear that performance-tuning for this batch is a good idea.

I mean, you can do any kind of tuning you want! Don't let me stop you! For me, I took a very long break, to do other things; I want to get back to language learning, and when I last worked on it, there was one

linas commented 3 years ago
  1. I noted that dict definitions are randomly done in both dict-common.h and dict-defines.h. I think most, if not all, should be moved to dict-defines.h. In any case, I will put DEFAULT_MAX_COST there.

One of these files is "public", and visible to API users. The other is private. If I remember corectly, dict-common is priavate. It is common because it has defnies that are shared by both the file-backend and the sql backend.

  1. What should be the exact names of the max-cost definitions? Is this OK?
    #define default_max_disjunct_cost 2.7
    #define panic_max_disjunct_cost 4.0

Uh, but the whole point is to define these in the dictionary, and ''not'' in the source code. So you are confusing me. Oh, you mean in the dictionary. Yes OK. .. err. don't call it "default" just call it max_disjunct_cost.

  1. We have already this unrelated function name defined in dict-impl.c and read-sql.c, which is misleading now. Should it be renamed, and if so, to which name?

?? What function? dictionary_setup_defines() ? This a fine place to initialize dictionary-related definitions...

BTW, isn't 4.0 too high for panic parsing, as it allows headlines?

Without detailed statistics, we'll never know.

Also, panic is about parse time, not an inability to produce a full parse. I don't understand why increasing the max-cost is a good strategy in such a case, as it increases the parse time by much due to the added disjuncts.

The 4.0 value is maybe a 20-year-old decision that has not been reviewed. If the original problem was that all parses were failing post-processing, then one answer is to examine more parses. But perhaps a better answer is to not do post-processing? Perhaps there are other answers.

It might be the case that only a handful of disjuncts, when used in a long sentence, are responsible for almost all PP failures. If we avoid those disjuncts, then maybe there won't even be an explosion! Without detailed statistics, I am left to guess blindly. It might be the case that a small number of disjuncts are responsible for all PP failures. If so, then removing those disjuncts from the dict would be the easiest solution. But without detailed statistics, this is impossible to know.

Gathering these statistics is what I mean by "science". It is hard work to gather them. And once you have them, you don't always learn anything. Or maybe you learn something "obvious". Left empty-handed. But without them, you are at zero, and can only guess.

linas commented 3 years ago

linkgrammar_get_dict_define() is good

linas commented 3 years ago

(during panic parsing) we also use a shorter connector length

This is also a 20-year-old decision that has not been reviewed. I think it makes sense, from a gut-feel perspective -- even "unlimited" connectors should probably be limited to 40 or 50 words. Or less. However, without any idea of what the distribution is, its hard to say. Ideally, we have 2-5 corpora: late 20th century newspapers; Karl Marx Manuscripts of 1844 (famous for very long sentences), early 20th-century adventure books from Project Gutenberg (Tarzan, Huck Finn, etc.), Wikipedia (famous for having very few action-verbs). Then measure:

I have measured some of this in the past. What have I learned? Uhh, "interesting stuff". Is it useful? Uhh, "it depends.". I think it is worth repeating this, and writing it up in a detailed, formal fashion.

Can we use it to tune panic-parsing? Yes, if we are clever.

ampli commented 3 years ago

Cannot we avoid overflows using my suggested method if we discard "offending disjuncts" in extractor_t and recalculate the multiplications

I don't recall what that is... and I don't know how to search for that discussion

I referred to what I said earlier in this discussion:

If the reason for doing so is to get better parses, it may be better to do it after running mk_parse_set(), in a scan of extractor_t similarly to set_overflowed(). Selected disjuncts in extractor_t can then be disabled according to some criteria, like counts and costs.

However, it seems to me now that just scanning the table for that purpose is not good. Instead, the extractor_t table can be scanned recursively, accumulating the costs when going deeper. Then by some heuristics, costly disjuncts can be discarded and the costs recalculated, repeating that until the count is "reasonable".

So you are saying that one side may be large, but the other side might be zero?

When a partial parse gets large it may be too late to "correct" this locally because the costly disjuncts (that could be omitted) may be deep inside the partial parse. If they are located and omitted, it seems that the said partial parse has to be redone (we don't have a hierarchy in the connector table to just allow cost recalculation). But maybe some other partial measures can be taken, like forbid in form_match_list() matching between 2 costly disjuncts (over some cost) but I don't know how much this may be effective.

It seems to me there is no great secrete there, as, for each sentence, the number of disjuncts monotonically increases with max-cost.

Is it linear? quadratic? cubic? For the hand-built English corpus, maybe it is not that interesting; it's the kind of thing I would look at for the automatically learned languages.

I don't know, but this may now be checked by a Python program that uses the !!// command.

Complexity is in the eye of the beholder. I always found it much, much easier to modify the dictionary, than the try to mess with post-processing. I suspect the original authors were trying to keep the dictionary as small as possible and thought that they could do that with post-processing. So, now we have a large complex dictionary... is that good? bad? Is there a better way? I plan to spend more time thinking about higher-order mechanisms...

Some things seem "impossible" to do with LG rules, like forcing a constrain according to link string, and things like "must form a cycle", and "cannot include ...".

failures batch.

OK, just realize this is a pathological batch, its a collection where the current English dict fails to work. It's not clear that performance-tuning for this batch is a good idea.

I also use the fix-long batch for checks. Lately, I use pandp-union.txt. Regarding the failures in the failures batch, I noted that many (most?) of them are due to tokenization support, and some of the problems can easily be fixed. I suggest first to do it using the current tokenizer.

ampli commented 3 years ago

One of these files is "public", and visible to API users. The other is private. If I remember correctly, dict-common is private. It is common because it has defines that are shared by both the file-backend and the sql backend.

I will take this into account when I fix the define PR.

confusing

It might be less confusing if I didn't forget the ; at the end (which is needed in the dict).

call it max_disjunct_cost

OK.

dictionary_setup_defines() ? This a fine place to initialize dictionary-related definitions...

OK.

BTW, isn't 4.0 too high for panic parsing, as it allows headlines?

Without detailed statistics, we'll never know.

BTW, now it is difficult to cause panic (in 30 seconds) for long sentences that have a full parse. And increasing the connector length doesn't help much in that. Long parse times are still a possible problem when parsing with null links when many nulls are present. This can be improved in some cases and more work is needed on that.

It might be the case that only a handful of disjuncts, when used in a long sentence, are responsible for almost all PP failures.

This is very interesting to check. I will try to see how to do that.

then removing those disjuncts from the dict would be the easiest solution

We can restrict their usage with a dialect.

linkgrammar_get_dict_define() is good

OK, I will change it.

Regarding connector length, I tried this with the fix-long batch: -short-length CPU time
20 7.43
50 14.35
100 20.91
200 22.74

Of course, it is not a "scientific" test, but it shows that now it is less length-depended than the old code. This may be even improved by WIP but I didn't test that yet.

ampli commented 3 years ago

But this happens after counting; I was hoping to make max cost dynamically adjustable during counting, to avoid overflows.

Why may it better to do it during counting?

ampli commented 3 years ago

But perhaps a better answer is to not do post-processing?

Regretfully, in my panic-redone code, only a list of predefined panic-* variables can be used, and they don't include for now panic_bad. Of course, panic_bad can be added, but the general problem is that arbitrary variables can still not be defined for panic mode (I thought that most of them are useless for panic mode - this could be a bad design decision and I guess it can be reversed if desired by adding back and adapting the old panic-* code).

ampli commented 3 years ago

You'd still want to have linkgrammar_get_dict_max_disjunct_cost()

Should we also have linkgrammar_get_dict_panic_max_disjunct_cost() or should the API user just use something like: linkgrammar_get_dict_define(dict, PANIC_MAX_COST_DEFNAME)

ampli commented 3 years ago
  1. I noted that dict definitions are randomly done in both dict-common.h and dict-defines.h. I think most, if not all, should be moved to dict-defines.h. In any case, I will put DEFAULT_MAX_COST there.

One of these files is "public", and visible to API users. The other is private. If I remember correctly, dict-common is private. It is common because it has defines that are shared by both the file-backend and the sql backend.

I cannot find any indication that dict-defines.h is (or intended to be) visible to API users. It also is not defined as include_HEADERS.

linas commented 3 years ago

dict-defines.h -- OK, yes, this was a "temporary" file created during refactoring, and was left as-is because it was "good enough".


I think this would be best: linkgrammar_get_dict_define(dict, LG_PANIC_MAX_COST) because we already have enough (too many?) API calls.

linas commented 3 years ago

Regarding connector length

OK, so, a tiny amount of psycho-linguistic theory. Sentences that have short links are easiest to understand by human beings. This is actually measurable, by appropriately designed psychology experiments... and has been measured in dozens of experiments for dozens of languages. They've also measured understandability as a function of the number connectors per word. (One interesting result is that English has been getting simpler over the last 400 years. Now, there aren't any 400-year-old English speakers, so you can't measure them, but you can parse (by hand) sentences from Shakespeare, and count the lengths of the links (not using LG, but using generic dependency-grammar links). And 20th century English is simpler than 19th century English, and even 19th century English is simpler than 18th century English. I think that is .. interesting.)

Now, the goal of parsers in general, or LG in particular, is to create linkages that correspond to what is happening in the brain of a "typical" English speaker. And, from these psychology experiments, we know that almost all links are short.

There are a few exceptions, though. One is the extremely long link from the beginning of the sentence to the period/punctuation at the end. This violates the short-link rule, and it also violates psychology. Now, what do I mean by "violates psychology"? Well, language is processed by two different parts of your brain. One part is just assembling words, one by one, creating short links. It works subconsciously, automatically, without effort. Another part is performing rational, conscious reasoning. Now, the rational-conscious reasoning part knows that, eventually, there should be some end to a sentence. It has the power, the ability to scan the entire sentence, and look for that ending. The unconscious, automatic part does not know, and does not care. It's like a conveyor belt in a factory, assembling words as they come, and if someone is talking on the phone, and the phone hangs up before the end of the sentence, it doesn't really care, it doesn't get upset. (The rational, thinking part does care, does get upset, and blames the battery, the antenna, the cell tower, whatever)

The problem with LG is that it is a mix of both. That makes things .. problematic. The extremely long link from the beginning of sentence to the period at the end really does not belong in the dictionary. It is useful, but it really should be added in post-processing. Another example: LG does not add links from "he/she/it" to the person/thing being referred to. This must be done at some later stage. These are very long links that cross sentence boundaries.

The "best fix" might be to identify which LG links are long, or "too long", and remove them from the base dictionary, and only add them back in during some post-processing stage. So, obviously the end-of-sentence-punctuation link. Maybe some others? If we could add links during post-processing, for end-of-sentence, then maybe we can also add links for reference resolution (he/she/it). Removing long links in the "surface parsing" stage will make the parser run faster (as you've shown). It might also improve accuracy, if we can somehow verify during post-processing.

ampli commented 3 years ago

I think this would be best: linkgrammar_get_dict_define(dict, LG_PANIC_MAX_COST) because we already have enough (too many?) API calls.

We can provide also LG_DICTIONARY_VERSION and LG_DICTIONARY_LOCALE and depreciate the corresponding API calls.

ampli commented 3 years ago

LG does not add links from "he/she/it" to the person/thing being referred to. This must be done at some later stage. These are very long links that cross sentence boundaries.

This and anaphora resolution in general can be a very nice project. We may start with such intra-links when possible. I thought of an implementation using LG rules. However, the current syntax and semantics of LG rules seem to me too limited, even for what it does just now.

ampli commented 3 years ago

There is also a question of what to do with dict-defines.h and dict-common.h. Possibilities:

  1. I will just add all the new definitions OF the define PR to dict-defines.h, and at future cleanup, we will decide what to do.
  2. Move to dict-defines.h all the constant definitions, including existing ones, and leave in dict-common.h only the data structure definitions.
  3. Move all dict-defines.h to dict-common.h.
ampli commented 3 years ago

... linkgrammar_get_dict_define(dict, LG_PANIC_MAX_COST)

I changed LG_PANIC_MAX_COST to LG_PANIC_DISJUNCT_COST, to be consistent with LG_DISJUNCT_COST.

Currently in dict-defines.h:

#define LG_DISJUNCT_COST                        "max-disjunct-cost"

In the WIP for the new panic-mode: link-includes.h:

#define LG_PANIC_DISJUNCT_COST "panic-max-disjunct-cost"

link_public_api(const char *)
    linkgrammar_get_dict_define(Dictionary, const char *);
linas commented 3 years ago

dict-defines.h

I have no opinion...

linas commented 3 years ago

This and anaphora resolution in general can be a very nice project. We may start with such intra-links when possible. I thought of an implementation using LG rules. However, the current syntax and semantics of LG rules seem to me too limited, even for what it does just now.

OK, so .. I keep saying ... the opencog AtomSpace is supposed to be a link-grammar-like generic infrastructure for connecting and linking things together. There are pros and cons: pros: it's well developed, debugged, has lots of tools. Cons: It has a number of design issues and problems. Is inventing something similar-but-different a good idea or a bad idea? Well, that depends on the day of the week.

I'd love to have this conversation with you.

ampli commented 3 years ago

I'd love to have this conversation with you.

One problem is that I know nothing about AtomSpace. I admit that I should learn it.

linas commented 3 years ago

I know nothing about AtomSpace.

You don't have to. I could explain the important parts as we go along. There are also several wiki pages and tutorials. But really, I'm interested in discussing the overall theory, of how things could work.

ampli commented 3 years ago

In the next days, I will read some papers regarding anaphora resolution so I will be able to say something meaningful about that issue.

ampli commented 3 years ago

I have read so far:

  1. ANAPHORA RESOLUTION: THE STATE OF THE ART
  2. Issues_in_Anaphora_Resolution

I took notes of how LG can help as a foundation. But the problem is of course much broader than what LG itself can do by itself.

I intend to continue to read some more papers today and in the next few days. Do you have particular ones that you recommend? We also need to open an issue for discussion. I can open it in the LG repository because we can first discuss using LG as a foundation.

linas commented 3 years ago

I have not read either (I'll try to, "real soon now"). Nor was I suggesting that we should use one particular algorithm vs. another. Instead, I as trying to raise a different issue: how might one, in principle, indicate references? Where is this data stored? What is the format? The markup might not be "anaphora", it might be other references, like TLA (three-letter acronym? or texas library association?). The verb "saw" - is the meaning "to cut" or past-tense of "see"? All of this is markup. My provisional solution is that all markup goes into the atomspace.

I would encourage you to read Sylvie Kahane "Meaning Text Theory" -- this has many marvelous insights into language (that are link-grammar compatible, but require something like the atomspace to represent everything.)

Yes, having a distinct issue for this might be best.

ampli commented 3 years ago

how might one, in principle, indicate references? Where is this data stored? What is the format?

Some anaphora resolution articles suggest data structures and even code (e.g. Issues_in_Anaphora_Resolution mentioned above). There are also several anaphora resolution projects on GitHub and additional free ones elsewhere. I think we should investigate them for the best solutions and invent something else if there is no existing solution for something or the existing one is not good.

But in general, it seems to me we need to store word (or word range) locations, their type, and their attributes.

The markup might not be "anaphora", it might be other references, like TLA

I would refer to such things exactly like anaphora (or cataphora), and if there is no appropriate antecedent in the given text (TLA may be defined in the text) then I would use auxiliary text, in that case, an acronym table (represented in the AtomSpace, with a list of attributes per acronym).

In any case: I feel I need to know more about AtomSpace to say something meaningful that includes this term. I also must read much more on anaphora/cataphora resolution.

But first I will read Meaning Text Theory. I guess you mean https://www.aclweb.org/anthology/J87-3006.pdf.

Yes, having a distinct issue for this might be best.

What should the subject of that issue?