quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
12.02k stars 670 forks source link

Why is the cpu usage so high, any suggestions for optimization? #1203

Closed carmel closed 2 years ago

carmel commented 2 years ago

2021-11-13T06:35:32.923850+08:00 INFO tantivy_server::index::add - Commit succeed, docstamp at 673641

Untitled

fulmicoton commented 2 years ago

Index on a single thread?

You can choose the number of threads when you create the index writer.

fulmicoton commented 2 years ago

Also you forgot to link your project. It is very hard to answer without that.

carmel commented 2 years ago

my repo. I've configured the number of threads via parameters, but the result is the same.

https://github.com/carmel/tantivy-server/blob/main/src/index/mod.rs

fn get_index_writer(index: &Index) -> Result<IndexWriter> {
    index
        // .writer_with_num_threads(
        //     CONF.index.thread_num,
        //     CONF.index.total_heap_size * 1024 * 1024,
        // )
        .writer(CONF.index.total_heap_size * 1024 * 1024)
        .map_err(|e| {
            Error::new(
                ErrorKind::Other,
                format!("Index writer_with_num_threads: {}", e),
            )
        })
}
carmel commented 2 years ago

Also other errors are reported here,

2021-11-13T07:01:20.863959+08:00 INFO tantivy::indexer::index_writer - Buffer limit reached, flushing segment with maxdoc=8205.
2021-11-13T07:01:20.944466+08:00 INFO tantivy::indexer::index_writer - Buffer limit reached, flushing segment with maxdoc=8259.
2021-11-13T07:01:21.058333+08:00 INFO tantivy::indexer::index_writer - Buffer limit reached, flushing segment with maxdoc=8213.
2021-11-13T07:01:21.186539+08:00 INFO tantivy::indexer::index_writer - Buffer limit reached, flushing segment with maxdoc=8220.
2021-11-13T07:01:21.202291+08:00 INFO tantivy::indexer::index_writer - Buffer limit reached, flushing segment with maxdoc=8191.
2021-11-13T07:01:21.455724+08:00 INFO tantivy::indexer::index_writer - Buffer limit reached, flushing segment with maxdoc=8309.
2021-11-13T07:01:34.879204+08:00 INFO tantivy::indexer::index_writer - Preparing commit
2021-11-13T07:01:41.382270+08:00 INFO tantivy::indexer::index_writer - Prepared commit 3362920
2021-11-13T07:01:41.382391+08:00 INFO tantivy::indexer::prepared_commit - committing 3362920
2021-11-13T07:01:41.386730+08:00 INFO tantivy::indexer::segment_updater - save metas
2021-11-13T07:01:41.418678+08:00 INFO tantivy::indexer::segment_updater - Running garbage collection
2021-11-13T07:01:41.418829+08:00 INFO tantivy::directory::managed_directory - Garbage collect
2021-11-13T07:01:41.458176+08:00 INFO tantivy_server::index::add - Commit succeed, docstamp at 3362920
fmassot commented 2 years ago

These are not errors, there are INFO logs.

The message Buffer limit reached tells you that the segments (filled with the indicated number of documents) take more memory than the memory budget you give to your indexer, in the code the condition is written like this: if mem_usage >= memory_budget - MARGIN_IN_BYTES. This message is totally fine.

But if I look at the number per segment, it is quite low so you must have a low memory budget or huge document.

Can you print in your logs the number of threads and the memory budget too? thanks!

carmel commented 2 years ago

These are not errors, there are INFO logs.

The message Buffer limit reached tells you that the segments (filled with the indicated number of documents) take more memory than the memory budget you give to your indexer, in the code the condition is written like this: if mem_usage >= memory_budget - MARGIN_IN_BYTES. This message is totally fine.

But if I look at the number per segment, it is quite low so you must have a low memory budget or huge document.

Can you print in your logs the number of threads and the memory budget too? thanks!

Ok, the variable total_heap_size in the code below is 50 mb. I did not set the number of threads in the current example as you see,

index
        // .writer_with_num_threads(
        //     CONF.index.thread_num,
        //     CONF.index.total_heap_size * 1024 * 1024,
        // )
        .writer(CONF.index.total_heap_size * 1024 * 1024)
fulmicoton commented 2 years ago

I'm not sure how you end up with > 100% CPU here.

Do you have more than one indexer? Or maybe you commit after adding every single doc and there are a lot of merge threads working? Can you share your meta.json file?

carmel commented 2 years ago

I only created two indexes.

my add index function:

pub fn add_index(index_json: &str) -> Result<()> {
    let json_index = serde_json::from_str::<IndexData>(index_json)?;

    let index = get_index(json_index.index)?;

    index
        .tokenizers()
        .register("jieba", jieba_tokenizer::JiebaTokenizer {});

    let schema = index.schema();

    let schema_clone = schema.clone();

    let mut index_writer = get_index_writer(&index)?;
    if CONF.index.is_merge {
        index_writer.set_merge_policy(Box::new(NoMergePolicy));
    }
    for m in json_index.data {
        let data = serde_json::to_string(&m)?;
        match schema_clone.parse_document(&data) {
            Ok(doc) => {
                index_writer.add_document(doc);
            }
            Err(e) => {
                // index_writer.rollback();
                return Err(Error::new(
                    ErrorKind::Other,
                    format!("DocParsingError: {}", e),
                ));
            }
        }
    }
    let index_result = index_writer.commit();

    match index_result {
        Ok(docstamp) => {
            info!("Commit succeed, docstamp at {}", docstamp);
            // info!("Waiting for merging threads");
            index_writer.wait_merging_threads().map_err(|e| {
                Error::new(ErrorKind::Other, format!("wait_merging_threads: {}", e))
            })?;
        }
        Err(e) => {
            index_writer.rollback().unwrap();
            return Err(Error::new(
                ErrorKind::Other,
                format!("add_index index_writer rollback: {}", e),
            ));
        }
    }
    Ok(())
}

my meta.json:

{
  "index_settings": {
    "docstore_compression": "lz4"
  },
  "segments": [
    {
      "segment_id": "0c310ae1-2e62-46da-9c7f-423e7e4e1472",
      "max_doc": 8437,
      "deletes": null
    },
    ....
    {
      "segment_id": "e95114e1-9971-4ed9-ac63-dd8d62f80d5e",
      "max_doc": 1,
      "deletes": null
    }
  ],
  "schema": [
    {
      "name": "id",
      "type": "u64",
      "options": {
        "indexed": true,
        "stored": true
      }
    },
    {
      "name": "book_id",
      "type": "text",
      "options": {
        "indexing": {
          "record": "basic",
          "tokenizer": "raw"
        },
        "stored": true
      }
    },
    {
      "name": "chapter",
      "type": "u64",
      "options": {
        "indexed": true,
        "stored": true
      }
    },
    {
      "name": "section",
      "type": "u64",
      "options": {
        "indexed": true,
        "stored": true
      }
    },
    {
      "name": "h",
      "type": "u64",
      "options": {
        "indexed": true,
        "stored": true
      }
    },
    {
      "name": "text",
      "type": "text",
      "options": {
        "indexing": {
          "record": "position",
          "tokenizer": "jieba"
        },
        "stored": true
      }
    }
  ],
  "opstamp": 18129698
}
fulmicoton commented 2 years ago

I'm actually mostly interested in the ... How many segment Id do you have in there?

carmel commented 2 years ago

total segment_id is 3005.I can not show the detail because of :

You can't comment at this time — your comment is too long (maximum is 65536 characters).

I have uploaded to gitter.

carmel commented 2 years ago

Also, the efficiency of adding indexes is a bit slow, from the time I launched the issue to now only commited 20672938.

2021-11-13T09:43:34.712509+08:00 INFO tantivy_server::index::add - Commit succeed, docstamp at 20672938

It took 3 hours and 8 minutes to commited 19,999,297 records.

carmel commented 2 years ago

Hi @fulmicoton , I have another serious problem here.

I indexed the data in the book_content table in my mysql database one by one through tantivy, and when the data was just a little more than half done, suddenly the program terminated due to lack of space, and the command showed that the index file had reached 85G. and the book_content.ibd file in the mysql directory was only 272.6 MB.

Please, is there a recommended solution for best practices!

fulmicoton commented 2 years ago

Same problem. You have two many segments because you don't let tantivy do any merge.

The index is too large for this reason.

carmel commented 2 years ago

Hi @fulmicoton , I performed the merge operation additionally at your suggestion, but the result is still not much space reduction. I'm sorry, I'm not sure about the underlying principle of the search engine, am I wrong in any way? I would like to ask you to help me again, sorry to take your valuable time.

image

My merge operation is done with reference to the tantivy-cli merge.rs.

fulmicoton commented 2 years ago

You should have < 20 segments in your meta.json? Similarly you should have < 20 ".idx" files in your directory.

Can you check both?

fulmicoton commented 2 years ago

To make it clear in the previous discussions, the following code is your problem... https://github.com/carmel/tantivy-server/blob/22b5c0acb02cb70c0cdc91700d69981a64fda881/src/index/add.rs#L17-L68

You should not create (and then drop) an index writer every time you add a doc.

carmel commented 2 years ago

I have 18,199 segment_id and 18,206 idx files. (One idx file is actually over 1gb)

image

Do you mean that for every index writer created, a segment is created after commit?

fulmicoton commented 2 years ago

Yes exactly. One segment is created after each commit. The index writer has a mechanism to merge stuff in the background, but it needs to be alive to do that.

fulmicoton commented 2 years ago

@carmel can we close this issue?

carmel commented 2 years ago

Okay, I haven't had a chance to try it yet, so let's close it for now.

carmel commented 2 years ago

Great, after modifying the code as you suggested, the problem is been solved. The space size has been change from 85 GB to 254 MB. Thanks again!

fulmicoton commented 2 years ago

@carmel great! Thank you for reporting the conclusion!