idumiY / lucene-gosen

Automatically exported from code.google.com/p/lucene-gosen
0 stars 0 forks source link

String which repeats "くよ" (about 20 times and over) consumes much execution time and memory. #19

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
https://gist.github.com/1441337

String which repeats "くよ" consumes (about 20 times and over) much execution 
time and memory.

- Java: 1.6.0_24(Oracle) and 1.7.0-147-icedtea(OpenJDK)
- Solr/Lucene: 3.5 and 4.0
- lucene-gosen: 1.2(ipadic & naist-chasen), 1.3(ipadic, naist-chasen is not 
tested)

Original issue reported on code.google.com by haruyama...@gmail.com on 7 Dec 2011 at 5:20

GoogleCodeExporter commented 8 years ago
Hi,

The same phenomenon was occurred in my env. 
 - lucene-gosen: 1.2(ipadic), Java: 1.7.0-1.
 - Intel Core i7-2640M @ 2.80GHz, RAM:8.0GB.

I did an investigation and found that:
 1) this is associated with Viterbi.java;
 2) the phenomenon can be reproduced only when the value of
lNode.rcAttr2 > 0;
 3) the first for-loop of the method "calculateConnectionCosts" run
over a million times just before the program getting out of memory.

I guess that there occurred a problem when applying "trigram" rules of 
morph analysis written in "connection.csv". 
Especially when the input string has a kind of form which force the program
to check trigram rules consecutively term by term. 

Regarding "くよくよ...", Gosen checks if the input string matches to the 
pattern "よく/形容詞,連用テ接続+ term2/pos2 + term3/pos3". 
(e.g. "よく/は/無い")
"よく/形容詞,連用テ接続" is used as the first term of the rule.

regards,
Mitsuharu Makita

Original comment by makita.m...@gmail.com on 8 Dec 2011 at 10:37

GoogleCodeExporter commented 8 years ago
Sorry for slow reply.
Thanks for reporting and investigation. 

I think that some are following as provisional correspondence. 

- Check the input string on the client side, and divide it into space. 
- Providing some kind of internal limiter lucene-gosen.

Original comment by johtani on 12 Dec 2011 at 2:55

GoogleCodeExporter commented 8 years ago
It is a still incomplete patch. 
Although the problem of relevance was lost, a part of existing test does not 
pass. 

As a result of a test, since a score differs from an old analysis result, it 
becomes an error. 

And it's necessary to test to much more data. 
Other bugs may lurk. 

Original comment by johtani on 12 Dec 2011 at 7:20

Attachments:

GoogleCodeExporter commented 8 years ago

Original comment by johtani on 14 Dec 2011 at 6:51

GoogleCodeExporter commented 8 years ago
Sorry, Comment#3 patch include bug...
In line 15, the first condition and the second are reverse. 

Now analyzing this bug...
It seems to be the bug which occurred at the time of movement of a loop....

Original comment by johtani on 16 Dec 2011 at 8:05

GoogleCodeExporter commented 8 years ago
Commit r158 in trunk and branch 1.2.

Testcase all OK.

The rest is tested by extensive data, and if satisfactory, it will be released. 

Original comment by johtani on 19 Dec 2011 at 7:37

GoogleCodeExporter commented 8 years ago

Original comment by johtani on 20 Dec 2011 at 9:14

GoogleCodeExporter commented 8 years ago
release 1.2.1

Original comment by johtani on 20 Dec 2011 at 9:15