Closed carmel closed 2 years ago
Index on a single thread?
You can choose the number of threads when you create the index writer.
Also you forgot to link your project. It is very hard to answer without that.
my repo. I've configured the number of threads via parameters, but the result is the same.
https://github.com/carmel/tantivy-server/blob/main/src/index/mod.rs
fn get_index_writer(index: &Index) -> Result<IndexWriter> {
index
// .writer_with_num_threads(
// CONF.index.thread_num,
// CONF.index.total_heap_size * 1024 * 1024,
// )
.writer(CONF.index.total_heap_size * 1024 * 1024)
.map_err(|e| {
Error::new(
ErrorKind::Other,
format!("Index writer_with_num_threads: {}", e),
)
})
}
Also other errors are reported here,
2021-11-13T07:01:20.863959+08:00 INFO tantivy::indexer::index_writer - Buffer limit reached, flushing segment with maxdoc=8205.
2021-11-13T07:01:20.944466+08:00 INFO tantivy::indexer::index_writer - Buffer limit reached, flushing segment with maxdoc=8259.
2021-11-13T07:01:21.058333+08:00 INFO tantivy::indexer::index_writer - Buffer limit reached, flushing segment with maxdoc=8213.
2021-11-13T07:01:21.186539+08:00 INFO tantivy::indexer::index_writer - Buffer limit reached, flushing segment with maxdoc=8220.
2021-11-13T07:01:21.202291+08:00 INFO tantivy::indexer::index_writer - Buffer limit reached, flushing segment with maxdoc=8191.
2021-11-13T07:01:21.455724+08:00 INFO tantivy::indexer::index_writer - Buffer limit reached, flushing segment with maxdoc=8309.
2021-11-13T07:01:34.879204+08:00 INFO tantivy::indexer::index_writer - Preparing commit
2021-11-13T07:01:41.382270+08:00 INFO tantivy::indexer::index_writer - Prepared commit 3362920
2021-11-13T07:01:41.382391+08:00 INFO tantivy::indexer::prepared_commit - committing 3362920
2021-11-13T07:01:41.386730+08:00 INFO tantivy::indexer::segment_updater - save metas
2021-11-13T07:01:41.418678+08:00 INFO tantivy::indexer::segment_updater - Running garbage collection
2021-11-13T07:01:41.418829+08:00 INFO tantivy::directory::managed_directory - Garbage collect
2021-11-13T07:01:41.458176+08:00 INFO tantivy_server::index::add - Commit succeed, docstamp at 3362920
These are not errors, there are INFO
logs.
The message Buffer limit reached
tells you that the segments (filled with the indicated number of documents) take more memory than the memory budget you give to your indexer, in the code the condition is written like this:
if mem_usage >= memory_budget - MARGIN_IN_BYTES
.
This message is totally fine.
But if I look at the number per segment, it is quite low so you must have a low memory budget or huge document.
Can you print in your logs the number of threads and the memory budget too? thanks!
These are not errors, there are
INFO
logs.The message
Buffer limit reached
tells you that the segments (filled with the indicated number of documents) take more memory than the memory budget you give to your indexer, in the code the condition is written like this:if mem_usage >= memory_budget - MARGIN_IN_BYTES
. This message is totally fine.But if I look at the number per segment, it is quite low so you must have a low memory budget or huge document.
Can you print in your logs the number of threads and the memory budget too? thanks!
Ok, the variable total_heap_size
in the code below is 50 mb. I did not set the number of threads in the current example as you see,
index
// .writer_with_num_threads(
// CONF.index.thread_num,
// CONF.index.total_heap_size * 1024 * 1024,
// )
.writer(CONF.index.total_heap_size * 1024 * 1024)
I'm not sure how you end up with > 100% CPU here.
Do you have more than one indexer? Or maybe you commit after adding every single doc and there are a lot of merge threads working? Can you share your meta.json file?
I only created two indexes.
my add index function:
pub fn add_index(index_json: &str) -> Result<()> {
let json_index = serde_json::from_str::<IndexData>(index_json)?;
let index = get_index(json_index.index)?;
index
.tokenizers()
.register("jieba", jieba_tokenizer::JiebaTokenizer {});
let schema = index.schema();
let schema_clone = schema.clone();
let mut index_writer = get_index_writer(&index)?;
if CONF.index.is_merge {
index_writer.set_merge_policy(Box::new(NoMergePolicy));
}
for m in json_index.data {
let data = serde_json::to_string(&m)?;
match schema_clone.parse_document(&data) {
Ok(doc) => {
index_writer.add_document(doc);
}
Err(e) => {
// index_writer.rollback();
return Err(Error::new(
ErrorKind::Other,
format!("DocParsingError: {}", e),
));
}
}
}
let index_result = index_writer.commit();
match index_result {
Ok(docstamp) => {
info!("Commit succeed, docstamp at {}", docstamp);
// info!("Waiting for merging threads");
index_writer.wait_merging_threads().map_err(|e| {
Error::new(ErrorKind::Other, format!("wait_merging_threads: {}", e))
})?;
}
Err(e) => {
index_writer.rollback().unwrap();
return Err(Error::new(
ErrorKind::Other,
format!("add_index index_writer rollback: {}", e),
));
}
}
Ok(())
}
my meta.json:
{
"index_settings": {
"docstore_compression": "lz4"
},
"segments": [
{
"segment_id": "0c310ae1-2e62-46da-9c7f-423e7e4e1472",
"max_doc": 8437,
"deletes": null
},
....
{
"segment_id": "e95114e1-9971-4ed9-ac63-dd8d62f80d5e",
"max_doc": 1,
"deletes": null
}
],
"schema": [
{
"name": "id",
"type": "u64",
"options": {
"indexed": true,
"stored": true
}
},
{
"name": "book_id",
"type": "text",
"options": {
"indexing": {
"record": "basic",
"tokenizer": "raw"
},
"stored": true
}
},
{
"name": "chapter",
"type": "u64",
"options": {
"indexed": true,
"stored": true
}
},
{
"name": "section",
"type": "u64",
"options": {
"indexed": true,
"stored": true
}
},
{
"name": "h",
"type": "u64",
"options": {
"indexed": true,
"stored": true
}
},
{
"name": "text",
"type": "text",
"options": {
"indexing": {
"record": "position",
"tokenizer": "jieba"
},
"stored": true
}
}
],
"opstamp": 18129698
}
I'm actually mostly interested in the ... How many segment Id do you have in there?
total segment_id is 3005.I can not show the detail because of :
You can't comment at this time — your comment is too long (maximum is 65536 characters).
I have uploaded to gitter.
Also, the efficiency of adding indexes is a bit slow, from the time I launched the issue to now only commited 20672938.
2021-11-13T09:43:34.712509+08:00 INFO tantivy_server::index::add - Commit succeed, docstamp at 20672938
It took 3 hours and 8 minutes to commited 19,999,297 records.
Hi @fulmicoton , I have another serious problem here.
I indexed the data in the book_content
table in my mysql database one by one through tantivy, and when the data was just a little more than half done, suddenly the program terminated due to lack of space, and the command showed that the index file had reached 85G
. and the book_content.ibd
file in the mysql directory was only 272.6 MB
.
Please, is there a recommended solution for best practices!
Same problem. You have two many segments because you don't let tantivy do any merge.
The index is too large for this reason.
Hi @fulmicoton , I performed the merge operation additionally at your suggestion, but the result is still not much space reduction. I'm sorry, I'm not sure about the underlying principle of the search engine, am I wrong in any way? I would like to ask you to help me again, sorry to take your valuable time.
My merge operation is done with reference to the tantivy-cli merge.rs.
You should have < 20 segments in your meta.json? Similarly you should have < 20 ".idx" files in your directory.
Can you check both?
To make it clear in the previous discussions, the following code is your problem... https://github.com/carmel/tantivy-server/blob/22b5c0acb02cb70c0cdc91700d69981a64fda881/src/index/add.rs#L17-L68
You should not create (and then drop) an index writer every time you add a doc.
I have 18,199 segment_id and 18,206 idx files. (One idx file is actually over 1gb)
Do you mean that for every index writer created, a segment is created after commit?
Yes exactly. One segment is created after each commit. The index writer has a mechanism to merge stuff in the background, but it needs to be alive to do that.
@carmel can we close this issue?
Okay, I haven't had a chance to try it yet, so let's close it for now.
Great, after modifying the code as you suggested, the problem is been solved. The space size has been change from 85 GB
to 254 MB
. Thanks again!
@carmel great! Thank you for reporting the conclusion!
2021-11-13T06:35:32.923850+08:00 INFO tantivy_server::index::add - Commit succeed, docstamp at 673641