Closed alexei-gl closed 6 years ago
I fogot to mention that I used latest github version.
The problem you encounter is due to panic mode kicking in.
You don't see this because you used !verbosity=0
so the timeout notification is also disabled.
The default timeout is 30 seconds, and the parsing of some sentences just takes more than that.
This happens since the sentences have many occurrences of the "'
" character which are almost always null-linked, and also many of them create alternative tokenizations (which add parsing overhead).
Since the parse is done with the "allow null links" option, it repeats for every potential null link, with more possible parses every time.
E.g. the followings sentence actually needs 40 seconds on my computer (I also used !spell=0
, see below), and the result has null-count=6 with 1.5M parses:
tis so,' said the duchess: 'and the moral of that is — oh, 'tis love, 'tis love, that makes the world go round! '
Since the exact place in the parse process in which the panic timeout happens is random (due to CPU-time processing fluctuations), the parses at the panic time at different runs may me different.
Solution:
So you may need to disable panic mode, or increase the timeout by much.
Alternatively, or in addition, if you don't need to process the actual parsing with null links, you may disable parsing with null links. Also make sure your library is compiled with spelling disabled (I guess it is so), or use !spell=0
.
In the same occasion, some other (non-related to this issue) observations:
less alice*.txt
to see that).'
" characters at start and end of sentences (those that are prepended to words in sentence middle sometimes belong to these words and are mostly OK) .I
" (23% of them) not to parse correctly at all (using the en
dict, I guess that with any
this doesn't matter.)BTW, here is a fast way to check the parse consistency:
$ md5sum 02/*.post | sort | uniq -w 32 -c
22 ce61a41036492f0336c636f8810420f2 02/t1.post
8 dab02f220204e4fd8447a86bf2ea8a8b 02/t15.post
Thank you for the deep explanation! I am going to run another test taking into account the settings you point me out and let you know.
-panic=0
did not help much. The results are still nondeterministic. -null=0 -panic=0
makes it dererministic, but it is not the option in my case, because of the loss of incomplete parses which are important to estimate parsing quality when dealing with iduced grammars.
issue2.tar.gz
I said:
So you may need to disable panic mode, or increase the timeout by much.
The "or" was misleading (sorry):
A high timeout must be used, because the normal parse stops according to the given timeout even if panic mode is disabled. You may use something like !timeout=99999999
to effectively disable the timeout.(Maybe a fix should be introduced to disable the timeout on !timeout=0
.)
You may still want to use a low-enough timeout to guard against very long sentences that may take much CPU time (even many hours) to produce a result. E.g: !panic=0 !timeout=500
(but if you use this with link-parser !verbosity=0
, you may then get an inconsistent result w/o an indication whether there was a timeout).
I realize that timeout is necessary in most cases and setting !timeout=99999999
might not be a good idea. Is there a way to set some realistic timeout value and still get deterministic results on corpora with long sentences even though parsing of some of them is ended by the timeout value?
Here is a more detailed (and long) analysis of the (i)repeatability.
linkage_limit
parse_option (-limit=N
in link-parser
which has a default of 1000), the linkages should be always repeatable (a bug in that has been recently fixed in the repository).-limit=N
, a random sampling happens (it is actually a repeatable pseudo-random by default - and bug in it was fixed in the previous version). In case of no timeout - the result should also be always repeatable.So in the problematic case we have 4 factors that we can improve in order to increase the repeatability:
Here are some improvement proposals for each of these factors:
a
" letter and still expect useful parses...Solution: A. If the said bad quotes exist in the raw input (it is apparently prepossessed), then just add such a quote character to the dict:, as follows:
-""": QUd- or <post-quote> or [[ZZZ-]];
+""" '.x: QUd- or <post-quote> or [[ZZZ-]];
B. Also, some preprocessing seems to separate contractions, e.g.
but it doesn 't matter a bit
This disrupts the working of the LG library in cases of some contractions (like that).
The above sentence is parsed correctly only without the separation, and takes much more time to be parsed - incorrectly - with the separation.
Solution:
In case such separation are needed for the language learning process and you would like to use the exact same sentences with the English dict, I propose to add to the dict the unrecognized separated contractions (can be easily found in the dict - their are currently defined as a whole word), e.g.:
doesn_'_t: doesn't;
C. Apparently, some preprocessing lowercase everything.
For initial language learning this may be fine. But if you want to compare to real English parses, changing of "I
" to 'i
' disrupts the parsing and contributes to possible timeouts ((for each "i
" which has no links the parse process restarts with a higher null_links value).
Solution:
I propose to leave "I
" as is, or to add "i
" as its synonym in the dict.
D. Avoid the problem altogether... Use -verbosity=1
and filter out timed-out sentences.
Reduce the number of possible parses (num_linkages in the problem description). Problems 1.A. to 1.C also may increase the number of parses. For example, 1.B. multiplies the number of parses by 4.
Increase linkage_limit by much.
For example, use -limit=100000
. But a too high value may contribute to the chance of timeouts.
(A change can be introduced in the LG library code to not count part of the failed linkages due to postprocessing - this will help in producing more valid linkages and is equivalent to enlarging the linkage_limit for such sentences only. But I don't know by how much it may increase repeatability in the case of timeouts.)
The proposals in (1) and (2) may be the most effective ones. BTW, it may be interesting to see the original raw corpus file (that you used for this issue).
Thanks again for the explanation. The corpus file is in attachment. I setup few other tests. Hopefully a solution will be found. test-corpus.tar.gz
Alexei,
a few comments...
If the link grammar takes sooo long to parse a sentence that the timeout is hit, the parse is almost surely fairly incorrect anyway
So I think you would do OK just to set a fairly high timeout, and then throw out all sentences/parses on which your timeout is hit, assuming these are too complex for the link parser to deal with right now.. and just calculate your statistics based on other sentences...
Also, I think that removing all caps is a bad idea for using the English dictionary in the link parser. It's maybe OK for ULL experiments (I still don't like it there, but at least it's LESS of an obviously bad idea there), but it's a really bad idea if using the link parser English dictionary, which is created to use capitalization as a cue to syntax sometimes...
ben
On Wed, Jul 11, 2018 at 3:51 PM, alexei-gl notifications@github.com wrote:
Thanks again for the explanation. The corpus file is in attachment. I setup few other tests. Hopefully a solution will be found. test-corpus.tar.gz https://github.com/opencog/link-grammar/files/2184725/test-corpus.tar.gz
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/opencog/link-grammar/issues/798#issuecomment-404176928, or mute the thread https://github.com/notifications/unsubscribe-auth/AFolXC5mTYc2K88ERoj-uZkGzRrn2YKWks5uFgLSgaJpZM4VHpBc .
-- Ben Goertzel, PhD http://goertzel.org
"The dewdrop world / Is the dewdrop world / And yet, and yet …" -- Kobayashi Issa
Ben,
We've already discussed and agreed with Anton and the team on keeping caps for the sake of getting propper reference parses while staying lowercased for the rest of the pipeline.
As for the timeout, it seems to be the only reasonable way of handling this issue along with corpus cleaning as Amir suggested. There is still one thing bothering me. timeout/linkage_limit values appear to be corpus/dictionary dependant. Sentences inconsistent for let's say Gutenberg/en pair might not be so for Gutenberg/some_induced_grammar and quality estimation comparing those two will not be correct. Don't you think so? We will discuss it with Anton tomorrow.
Hi @alexei-gl,
The corpus file is in attachment. test-corpus.tar.gz
This is the corpus as attached to your tests. It looks like a processed version of an original corpus: It has separated contractions, all of it is lowercase, it uses the same character code for quotes and apostrophes. If indeed the original version was different, it is interesting to see it in its raw unmodified form.
Hi, @ampli , Please, see the attachment. alice-original.tar.gz
Is this issue resolved, or is this still an issue?
It works pretty much deterministicaly on our current corpora with ralatively short sentences if panic is set to zero and infinite (very long) timeout is specified. For current corpora processing time is satisfactory. For corpora with large sentences parse quality estimation algorithm is going to be improved to exclude inconsistent parses, if any, from estimation.
One more question. When I am piping sentences to link-parser
I usualy get one linkage for each sentence in return. Is there any way of getting some limited number of linkages for each input sentence without switching to interactive mode?
Yes.
- Produce more than one linkage. There is a testing option "-test =next-auto-linkage:N" that produces all the linkages (up to min(linkage_limit, N)). For each sentence, the first ones that have equal metric can be then sorted (by a script that understand their metric and multi-line nature) and the first one after this sorting has then a higher chance to be repeatable. (I use a similar script to compare parses between releases, that often are otherwise not the same due to subtle changes that change the order of equal-metric linkages.)
Here are some more details and clarifications:
link-parser
in batch mode is to produce one linkage at most (this behaviour got inherited from its 4.x version). In order to facilitate mass linkage tests, I added the said testing option.-limit=L
, to prevent a linkage random sampling or to have enough random samples that pass postprocessing. If it is not high enough and you have random sampling, the best linkage may not get extracted at all and/or you may get a linkage with a higher number of unlinked words (because the random sampling of parsing with fewer null-links didn't happen to find any postprocessing-valid linkage).
The default linkage limit for link-parser
is 1000 (for the C library and the Python bindings it is only 100!).
You may want to use maybe 10000 or even more (on the classic parser it doesn't have a too big performance impact, but note that it has some impact and hence it may contribute to timeouts).link-parser
displays up to 1024 linkages. You can change that by using an optional :N
after -test =next-auto-linkage
. Up to min(L, N) linkages are then displayed.One more point:
rand()
and qsort()
functions). Changes in the C library may cause a different linkage order even for the same LG library version (e.g. when you use it on different machines that have different C libraries).I might be missing something, but it doesn't seem to work. Even when I set -test
option as you recomend I still get only the first linkage:
echo "My horse moved on hoof after hoof he raised and never stopped when down behind the cottage roof at once the bright moon dropped." | link-parser en -postscript=1 -graphics=0 -test=next-auto-linkage:3
postscript set to 1
graphics set to 0
test set to next-auto-linkage:3
link-grammar: Info: Dictionary found at /home/alex/miniconda3/envs/ull4/share/link-grammar/en/4.0.dict
link-grammar: Info: Dictionary version 5.5.1, locale en_US.UTF-8
link-grammar: Info: Library version link-grammar-5.5.0. Enter "!help" for help.
link-grammar: Warning: Tests enabled: ,next-auto-linkage:3,
No complete linkages found.
Found 410 linkages (332 had no P.P. violations) at null count 3
Linkage 1, cost vector = (UNUSED=3 DIS=-0.51 LEN=67)
[(LEFT-WALL)(my.p)(horse.n-m)(moved.v-d)(on)([hoof])(after)([hoof])(he)(raised.v-d)
(and.j-v)(never.e)(stopped.v-d)(when)([down])(behind.p)(the)(cottage.n)(roof.n)(at)
(once)(the)(bright.a)(moon.n)(dropped.v-d)(.)]
[[0 25 6 (Xp)][0 3 2 (WV)][0 2 1 (Wd)][2 3 0 (Ss*s)][1 2 0 (Ds**c)][3 6 1 (MVs)][3 4 0 (K)]
[6 10 2 (CV)][6 8 0 (Cs)][8 10 1 (Ss)][9 10 0 (VJlst)][10 23 5 (Os)][10 20 4 (MVa)][10 15 3 (MVp)]
[10 13 2 (MVs)][13 15 0 (Mp)][10 12 1 (VJrst)][11 12 0 (E)][15 18 2 (Js)][16 18 1 (Ds**x)][17 18 0 (AN)]
[19 20 0 (IDBXA)][21 23 1 (Ds**x)][22 23 0 (A)][23 24 0 (Mv)]]
[0]
link-grammar: Info: Freeing dictionary en/4.0.dict
link-grammar: Info: Freeing dictionary en/4.0.affix
Bye.
If I get you right, three linkages should be returned instead of one.
Sorry... This is because I had a typo in my original post from 10 days ago and then copied/pasted this typo to my last post....
It works fine if you specify -test=auto-next-linkage:3
.
BTW, the doc is in debug/README.md.
It works. Thanks!
Please reopen this issue if something more can be done regarding it.
link-parser
returns different parses when parsing the same corpus file multiple times. I carried out two tests with 29 and 30 runs at different points in time. The first test gave me two different versions of parses. The first version repeats 2 times out of 29 while the second one repeats 27 times out of 29. The second test gave me two different variants of parses where the first one repeats 8 times out of 30 and the second one repeats 22 times out of 30. In two tests inconsistent sentence parses are different. I attached all test files. issue.tar.gz