idumiY / lucene-gosen

Automatically exported from code.google.com/p/lucene-gosen
0 stars 0 forks source link

Solr stucks on high load with japanese queries #25

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Create index of Japanese texts with solr.JapaneseTokenizer
2. Start stress test with SolrMeter http://code.google.com/p/solrmeter/ on high 
query-per-minute (say 300) setting with Japanese queries.
3. About 6-8 minutes later, I see Solr's response periodically stuck (see 
attachment) on client side, get very high load average and unexpectedly 
frequent Full GC of JVM on server side.

What is the expected output? What do you see instead?
Solr response search results smoothly even relatively high load.

What version of the product are you using? On what operating system?
lucene-gosen-1.2.1-ipadic.jar, Solr 3.5, OpenJDK 1.6 and CentOS 5.

Please provide any additional information below.
I confirmed this issue reproduce only when I passed Japanese queries to Solr 
through SolrMeter. There was no problem when I passed alpha-numeric only 
queries with same settings.
During the issue occuring, I couldn't find any relevant part of errors or 
warnings in Solr's log. But sum up situation above, I guess deadlock occurs 
somewhere in lucene-gossen on high load. Codes related to multibytes may 
relevant.

Original issue reported on code.google.com by ya...@hatena.ne.jp on 24 Feb 2012 at 12:27

Attachments:

GoogleCodeExporter commented 8 years ago
Hi,
Thanks for using lucene-gosen.

If you do not mind, would you teach me the setting of each 
FieldType(lucene-gosen and alpha-numeric)?
How many do you set a query pattern in queries.txt?
How many do you set a number of setting queryResultCache?

-Jun

Original comment by johtani on 27 Feb 2012 at 9:03

GoogleCodeExporter commented 8 years ago
Hi, thanks for your response.

Through the stress test, I used following FieldType setting.
   <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.JapaneseTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
   </fieldType>

Just to clarify, I tried 2 types of queries.txt (japanese or alpha-numeric) for 
a field of this FieldType only.
After I changed type of queries from japanese to alpha-numeric, Solr didn't 
stuck and started working smoothly.

> How many do you set a query pattern in queries.txt?
I set 5,000 patterns of japanese queries and alpha-numeric queries respectively.
(i.e. 5000 lines of janenese-queries.txt and another 5000 lines of 
alpha-numeric-queries.txt )
I also tried 1000 patterns. But I got same results.

> How many do you set a number of setting queryResultCache?
I tried default (512), larger (5000) and no cache (comment-out relevant part in 
solrconfig.xml). But I got same results.

- Yanbe

Original comment by ya...@hatena.ne.jp on 27 Feb 2012 at 10:37

GoogleCodeExporter commented 8 years ago
Sorry for late reply.

Are the values ("Result count" in "Query Statistics" tab) in both of (japanese 
/ alpha-numeric) stress test close to each other?

> Just to clarify, I tried 2 types of queries.txt (japanese or alpha-numeric) 
for a field of this FieldType only.
> After I changed type of queries from japanese to alpha-numeric, Solr didn't 
stuck and started working smoothly.

Is the phrase contained in queries.txt ?
Alpha-numeric term is not tokennized by lucene-gosen tokenizer.
This means that lucene-gosen tokenizer does not create lattice for 
alpha-numeric.

Lucene-gosen tokenizer tokenize process

 1. create lattice.
 2. find best path.
 3. return tokens

Many path are found in Japanese tokenize processing. 
For this reason, many objects are generated. 

Reference site.(lucene-gosen performance test)

http://wiki.livedoor.jp/haruyama_seigo/d/Solr/Tokenizer%c9%be%b2%c1201105
http://www.rondhuit.com/solr%E3%81%AE%E6%97%A5%E6%9C%AC%E8%AA%9E%E5%AF%BE%E5%BF%
9C.html

Original comment by johtani on 2 Mar 2012 at 8:38