opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
388 stars 119 forks source link

Nondeterministic behavior of link-parser #798

Closed alexei-gl closed 6 years ago

alexei-gl commented 6 years ago

link-parser returns different parses when parsing the same corpus file multiple times. I carried out two tests with 29 and 30 runs at different points in time. The first test gave me two different versions of parses. The first version repeats 2 times out of 29 while the second one repeats 27 times out of 29. The second test gave me two different variants of parses where the first one repeats 8 times out of 30 and the second one repeats 22 times out of 30. In two tests inconsistent sentence parses are different. I attached all test files. issue.tar.gz

alexei-gl commented 6 years ago

I fogot to mention that I used latest github version.

ampli commented 6 years ago

The problem you encounter is due to panic mode kicking in. You don't see this because you used !verbosity=0 so the timeout notification is also disabled.

The default timeout is 30 seconds, and the parsing of some sentences just takes more than that. This happens since the sentences have many occurrences of the "'" character which are almost always null-linked, and also many of them create alternative tokenizations (which add parsing overhead). Since the parse is done with the "allow null links" option, it repeats for every potential null link, with more possible parses every time.

E.g. the followings sentence actually needs 40 seconds on my computer (I also used !spell=0, see below), and the result has null-count=6 with 1.5M parses: tis so,' said the duchess: 'and the moral of that is — oh, 'tis love, 'tis love, that makes the world go round! '

Since the exact place in the parse process in which the panic timeout happens is random (due to CPU-time processing fluctuations), the parses at the panic time at different runs may me different.

Solution: So you may need to disable panic mode, or increase the timeout by much. Alternatively, or in addition, if you don't need to process the actual parsing with null links, you may disable parsing with null links. Also make sure your library is compiled with spelling disabled (I guess it is so), or use !spell=0.

In the same occasion, some other (non-related to this issue) observations:

  1. Your corpus file starts with a BOM, which disrupts its first word (use e.g less alice*.txt to see that).
  2. It is hard enough to do unsupervised learning tests with well formatted English corpus, and working with garbled files may make it much harder, so I think it could be better first to work with well formatted corpus and then maybe extend the work to various degrees of garbled corpuses. Mainly this things garble this corpus:
    • "'" characters at start and end of sentences (those that are prepended to words in sentence middle sometimes belong to these words and are mostly OK) .
    • It seems you converted everything to lowercase. Among other things, this makes all the sentences with the word "I" (23% of them) not to parse correctly at all (using the en dict, I guess that with any this doesn't matter.)

BTW, here is a fast way to check the parse consistency:

$ md5sum 02/*.post | sort | uniq -w 32 -c
     22 ce61a41036492f0336c636f8810420f2  02/t1.post
      8 dab02f220204e4fd8447a86bf2ea8a8b  02/t15.post
alexei-gl commented 6 years ago

Thank you for the deep explanation! I am going to run another test taking into account the settings you point me out and let you know.

alexei-gl commented 6 years ago

-panic=0 did not help much. The results are still nondeterministic. -null=0 -panic=0 makes it dererministic, but it is not the option in my case, because of the loss of incomplete parses which are important to estimate parsing quality when dealing with iduced grammars. issue2.tar.gz

ampli commented 6 years ago

I said:

So you may need to disable panic mode, or increase the timeout by much.

The "or" was misleading (sorry): A high timeout must be used, because the normal parse stops according to the given timeout even if panic mode is disabled. You may use something like !timeout=99999999 to effectively disable the timeout.(Maybe a fix should be introduced to disable the timeout on !timeout=0.)

You may still want to use a low-enough timeout to guard against very long sentences that may take much CPU time (even many hours) to produce a result. E.g: !panic=0 !timeout=500 (but if you use this with link-parser !verbosity=0, you may then get an inconsistent result w/o an indication whether there was a timeout).

alexei-gl commented 6 years ago

I realize that timeout is necessary in most cases and setting !timeout=99999999 might not be a good idea. Is there a way to set some realistic timeout value and still get deterministic results on corpora with long sentences even though parsing of some of them is ended by the timeout value?

ampli commented 6 years ago

Here is a more detailed (and long) analysis of the (i)repeatability.

So in the problematic case we have 4 factors that we can improve in order to increase the repeatability:

  1. Reduce the chance of a timeout.
  2. Reduce the number of possible parses.
  3. Increase linkage_limit by much.
  4. Produce more than one linkage (i.e. produce many linkages), and sort the equal-metric ones according to some non-metric criteria (like alphabetic order of their linkage representation) and only then take the first one.

Here are some improvement proposals for each of these factors:

  1. Reduce the chance of a timeout. This can be done by using less "garbled" input sentences. The current sentences are severely garbled by single quotes that should have actually be double quotes - the current English dict expects quoted text to use double quotes (of several kinds). If the code of the quotes is wrong according to this expectation - the parse quality becomes very bad, and it also takes very much time (due to the resulted null links). It is as if you use an unexpected code for the "a" letter and still expect useful parses...

Solution: A. If the said bad quotes exist in the raw input (it is apparently prepossessed), then just add such a quote character to the dict:, as follows:

-""": QUd- or <post-quote> or [[ZZZ-]];
+""" '.x: QUd- or <post-quote> or [[ZZZ-]];

B. Also, some preprocessing seems to separate contractions, e.g. but it doesn 't matter a bit This disrupts the working of the LG library in cases of some contractions (like that). The above sentence is parsed correctly only without the separation, and takes much more time to be parsed - incorrectly - with the separation.

Solution: In case such separation are needed for the language learning process and you would like to use the exact same sentences with the English dict, I propose to add to the dict the unrecognized separated contractions (can be easily found in the dict - their are currently defined as a whole word), e.g.: doesn_'_t: doesn't;

C. Apparently, some preprocessing lowercase everything. For initial language learning this may be fine. But if you want to compare to real English parses, changing of "I" to 'i' disrupts the parsing and contributes to possible timeouts ((for each "i" which has no links the parse process restarts with a higher null_links value).

Solution: I propose to leave "I" as is, or to add "i" as its synonym in the dict.

D. Avoid the problem altogether... Use -verbosity=1 and filter out timed-out sentences.

  1. Reduce the number of possible parses (num_linkages in the problem description). Problems 1.A. to 1.C also may increase the number of parses. For example, 1.B. multiplies the number of parses by 4.

  2. Increase linkage_limit by much. For example, use -limit=100000. But a too high value may contribute to the chance of timeouts. (A change can be introduced in the LG library code to not count part of the failed linkages due to postprocessing - this will help in producing more valid linkages and is equivalent to enlarging the linkage_limit for such sentences only. But I don't know by how much it may increase repeatability in the case of timeouts.)

    1. Produce more than one linkage. There is a testing option "-test =next-auto-linkage:N" that produces all the linkages (up to min(linkage_limit, N)). For each sentence, the first ones that have equal metric can be then sorted (by a script that understand their metric and multi-line nature) and the first one after this sorting has then a higher chance to be repeatable. (I use a similar script to compare parses between releases, that often are otherwise not the same due to subtle changes that change the order of equal-metric linkages.)

The proposals in (1) and (2) may be the most effective ones. BTW, it may be interesting to see the original raw corpus file (that you used for this issue).

alexei-gl commented 6 years ago

Thanks again for the explanation. The corpus file is in attachment. I setup few other tests. Hopefully a solution will be found. test-corpus.tar.gz

bgoertzel commented 6 years ago

Alexei,

a few comments...

If the link grammar takes sooo long to parse a sentence that the timeout is hit, the parse is almost surely fairly incorrect anyway

So I think you would do OK just to set a fairly high timeout, and then throw out all sentences/parses on which your timeout is hit, assuming these are too complex for the link parser to deal with right now.. and just calculate your statistics based on other sentences...

Also, I think that removing all caps is a bad idea for using the English dictionary in the link parser. It's maybe OK for ULL experiments (I still don't like it there, but at least it's LESS of an obviously bad idea there), but it's a really bad idea if using the link parser English dictionary, which is created to use capitalization as a cue to syntax sometimes...

ben

On Wed, Jul 11, 2018 at 3:51 PM, alexei-gl notifications@github.com wrote:

Thanks again for the explanation. The corpus file is in attachment. I setup few other tests. Hopefully a solution will be found. test-corpus.tar.gz https://github.com/opencog/link-grammar/files/2184725/test-corpus.tar.gz

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/opencog/link-grammar/issues/798#issuecomment-404176928, or mute the thread https://github.com/notifications/unsubscribe-auth/AFolXC5mTYc2K88ERoj-uZkGzRrn2YKWks5uFgLSgaJpZM4VHpBc .

-- Ben Goertzel, PhD http://goertzel.org

"The dewdrop world / Is the dewdrop world / And yet, and yet …" -- Kobayashi Issa

alexei-gl commented 6 years ago

Ben,

We've already discussed and agreed with Anton and the team on keeping caps for the sake of getting propper reference parses while staying lowercased for the rest of the pipeline.

As for the timeout, it seems to be the only reasonable way of handling this issue along with corpus cleaning as Amir suggested. There is still one thing bothering me. timeout/linkage_limit values appear to be corpus/dictionary dependant. Sentences inconsistent for let's say Gutenberg/en pair might not be so for Gutenberg/some_induced_grammar and quality estimation comparing those two will not be correct. Don't you think so? We will discuss it with Anton tomorrow.

ampli commented 6 years ago

Hi @alexei-gl,

The corpus file is in attachment. test-corpus.tar.gz

This is the corpus as attached to your tests. It looks like a processed version of an original corpus: It has separated contractions, all of it is lowercase, it uses the same character code for quotes and apostrophes. If indeed the original version was different, it is interesting to see it in its raw unmodified form.

alexei-gl commented 6 years ago

Hi, @ampli , Please, see the attachment. alice-original.tar.gz

linas commented 6 years ago

Is this issue resolved, or is this still an issue?

alexei-gl commented 6 years ago

It works pretty much deterministicaly on our current corpora with ralatively short sentences if panic is set to zero and infinite (very long) timeout is specified. For current corpora processing time is satisfactory. For corpora with large sentences parse quality estimation algorithm is going to be improved to exclude inconsistent parses, if any, from estimation.

alexei-gl commented 6 years ago

One more question. When I am piping sentences to link-parser I usualy get one linkage for each sentence in return. Is there any way of getting some limited number of linkages for each input sentence without switching to interactive mode?

ampli commented 6 years ago

Yes.

  1. Produce more than one linkage. There is a testing option "-test =next-auto-linkage:N" that produces all the linkages (up to min(linkage_limit, N)). For each sentence, the first ones that have equal metric can be then sorted (by a script that understand their metric and multi-line nature) and the first one after this sorting has then a higher chance to be repeatable. (I use a similar script to compare parses between releases, that often are otherwise not the same due to subtle changes that change the order of equal-metric linkages.)

Here are some more details and clarifications:

ampli commented 6 years ago

One more point:

alexei-gl commented 6 years ago

I might be missing something, but it doesn't seem to work. Even when I set -test option as you recomend I still get only the first linkage:

echo "My horse moved on hoof after hoof he raised and never stopped when down behind the cottage roof at once the bright moon dropped." | link-parser en -postscript=1 -graphics=0 -test=next-auto-linkage:3
postscript set to 1
graphics set to 0
test set to next-auto-linkage:3
link-grammar: Info: Dictionary found at /home/alex/miniconda3/envs/ull4/share/link-grammar/en/4.0.dict
link-grammar: Info: Dictionary version 5.5.1, locale en_US.UTF-8
link-grammar: Info: Library version link-grammar-5.5.0. Enter "!help" for help.
link-grammar: Warning: Tests enabled: ,next-auto-linkage:3,
No complete linkages found.
Found 410 linkages (332 had no P.P. violations) at null count 3
    Linkage 1, cost vector = (UNUSED=3 DIS=-0.51 LEN=67)
[(LEFT-WALL)(my.p)(horse.n-m)(moved.v-d)(on)([hoof])(after)([hoof])(he)(raised.v-d)
(and.j-v)(never.e)(stopped.v-d)(when)([down])(behind.p)(the)(cottage.n)(roof.n)(at)
(once)(the)(bright.a)(moon.n)(dropped.v-d)(.)]
[[0 25 6 (Xp)][0 3 2 (WV)][0 2 1 (Wd)][2 3 0 (Ss*s)][1 2 0 (Ds**c)][3 6 1 (MVs)][3 4 0 (K)]
[6 10 2 (CV)][6 8 0 (Cs)][8 10 1 (Ss)][9 10 0 (VJlst)][10 23 5 (Os)][10 20 4 (MVa)][10 15 3 (MVp)]
[10 13 2 (MVs)][13 15 0 (Mp)][10 12 1 (VJrst)][11 12 0 (E)][15 18 2 (Js)][16 18 1 (Ds**x)][17 18 0 (AN)]
[19 20 0 (IDBXA)][21 23 1 (Ds**x)][22 23 0 (A)][23 24 0 (Mv)]]
[0]

link-grammar: Info: Freeing dictionary en/4.0.dict
link-grammar: Info: Freeing dictionary en/4.0.affix
Bye.

If I get you right, three linkages should be returned instead of one.

ampli commented 6 years ago

Sorry... This is because I had a typo in my original post from 10 days ago and then copied/pasted this typo to my last post.... It works fine if you specify -test=auto-next-linkage:3. BTW, the doc is in debug/README.md.

alexei-gl commented 6 years ago

It works. Thanks!

ampli commented 6 years ago

Please reopen this issue if something more can be done regarding it.