quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.75k stars 649 forks source link

Indexing an array of small pieces of text is much slower than indexing a big string #1654

Open fmassot opened 1 year ago

fmassot commented 1 year ago

I still don't know if there is room for optimization there but the performance drop is significant. I observed on a dataset with documents containing a big array of small text, a drop of 25% in indexing throughput.

I pushed on a branch a file hdfs-with-array.json (hdfs json with a body made of array of text) to reproduce the performance drop in the bench. Here are the result:

Benchmarking index-hdfs/index-hdfs-no-commit: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 11.4s, or reduce sample count to 10
index-hdfs/index-hdfs-no-commit
                        time:   [538.24 ms 541.50 ms 545.32 ms]
                        change: [-3.6931% -2.4507% -1.2364%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
Benchmarking index-hdfs/index-hdfs-with-array-no-commit: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 17.7s, or reduce sample count to 10.
index-hdfs/index-hdfs-with-array-no-commit
                        time:   [879.80 ms 883.73 ms 888.34 ms]
                        change: [-2.6182% -1.1315% +0.0697%] (p = 0.12 > 0.05)
                        No change in performance detected.
Found 2 outliers among 20 measurements (10.00%)

Here is the branch: https://github.com/quickwit-oss/tantivy/tree/fmassot--bench-hdfs-with-array

PSeitz commented 1 year ago

here's a diff on the profile.

There's some noise, but we can see the alloc overhead for the array (array itself + multiple strings), slower deserialization, slower value_from_json.

9.61% +4.74% libc.so.6 [.] 0x000000000005a0d9 could be another alloc call, e.g. calloc.

Check IO/CPU overload!

# Event 'cycles'
#
# Baseline  Delta Abs  Shared Object                 Symbol                                                                                                                                                                                                                          >
# ........  .........  ............................  ................................................................................................................................................................................................................................>
#
    24.16%     -9.51%  index_bench-075aa16d259e3fd0  [.] crossbeam_channel::flavors::array::Channel<T>::send
     9.61%     +4.74%  libc.so.6                     [.] 0x000000000005a0d9
     0.87%     +3.53%  index_bench-075aa16d259e3fd0  [.] crossbeam_channel::context::Context::wait_until
    13.41%     -2.98%  index_bench-075aa16d259e3fd0  [.] <tantivy::postings::postings_writer::SpecializedPostingsWriter<Rec> as tantivy::postings::postings_writer::PostingsWriter>::subscribe
     9.31%     -2.32%  index_bench-075aa16d259e3fd0  [.] core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
     0.14%     +2.01%  index_bench-075aa16d259e3fd0  [.] <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
     3.24%     +1.57%  index_bench-075aa16d259e3fd0  [.] <tantivy::tokenizer::simple_tokenizer::SimpleTokenStream as tantivy::tokenizer::tokenizer::TokenStream>::advance
     0.96%     +1.27%  libc.so.6                     [.] malloc
     4.85%     -1.20%  libm.so.6                     [.] 0x0000000000067970
     0.38%     +1.10%  index_bench-075aa16d259e3fd0  [.] tantivy::schema::field_type::FieldType::value_from_json
     0.36%     +1.06%  libc.so.6                     [.] cfree
     3.48%     -0.98%  index_bench-075aa16d259e3fd0  [.] tantivy::postings::stacker::expull::ExpUnrolledLinkedListWriter::write_u32_vint
     1.14%     +0.95%  index_bench-075aa16d259e3fd0  [.] tantivy::schema::schema::Schema::json_object_to_doc
     3.15%     -0.88%  index_bench-075aa16d259e3fd0  [.] murmurhash32::murmurhash2::murmurhash2
     0.62%     +0.63%  index_bench-075aa16d259e3fd0  [.] alloc::slice::merge_sort
     1.83%     -0.46%  libm.so.6                     [.] exp
     1.17%     -0.43%  index_bench-075aa16d259e3fd0  [.] serde_json::de::from_str
     1.59%     -0.34%  index_bench-075aa16d259e3fd0  [.] <serde_json::read::StrRead as serde_json::read::Read>::parse_str
     2.31%     -0.31%  index_bench-075aa16d259e3fd0  [.] <tantivy::tokenizer::lower_caser::LowerCaserTokenStream as tantivy::tokenizer::tokenizer::TokenStream>::advance
     0.14%     +0.31%  index_bench-075aa16d259e3fd0  [.] core::hash::BuildHasher::hash_one
     0.09%     +0.30%  index_bench-075aa16d259e3fd0  [.] <tantivy::tokenizer::simple_tokenizer::SimpleTokenizer as tantivy::tokenizer::tokenizer::Tokenizer>::token_stream
     2.49%     -0.29%  index_bench-075aa16d259e3fd0  [.] core::str::iter::SplitInternal<P>::next
     0.92%     -0.25%  index_bench-075aa16d259e3fd0  [.] tantivy_common::vint::serialize_vint_u32
     0.05%     +0.24%  index_bench-075aa16d259e3fd0  [.] itertools::groupbylazy::GroupBy<K,I,F>::step
fmassot commented 1 year ago

Thanks for the report.

I moved the json parsing and document building out of the benchmark and I still see almost a factor of two between array VS no array (I updated the branch).

index-hdfs/index-hdfs-no-commit
                        time:   [467.03 ms 468.46 ms 470.16 ms]
                        change: [-53.083% -52.861% -52.630%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
Benchmarking index-hdfs/index-hdfs-with-array-no-commit: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 16.7s, or reduce sample count to 10.
index-hdfs/index-hdfs-with-array-no-commit
                        time:   [810.60 ms 814.71 ms 819.59 ms]
                        change: [-57.361% -57.112% -56.796%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild
fulmicoton commented 1 year ago

@trinity-1686a can you have a look at this?

trinity-1686a commented 1 year ago

I've modified slightly the bench so it does not include cloning the Vec<Document> and getting an index writer.

index-hdfs/index-hdfs-no-commit                        
                        time:   [5.8195 s 5.9066 s 5.9877 s]
index-hdfs/index-hdfs-with-array-no-commit                        
                        time:   [8.3026 s 8.3140 s 8.3286 s]

ratio = x1.40

perf samples:

index-hdfs/index-hdfs-no-commit                        
                        4981
index-hdfs/index-hdfs-with-array-no-commit                        
                        6904

ratio = 1.39 delta=1923

most of the delta are in:

They collectively account for 1589 of the 1923 more samples, or 82% of the difference, 66 being allocation/free.

So the TLDR is that allocation/free related operations are 10x slower when we multiply by 9.2 the number of field values, while keeping the same actual amount of raw content. And that makes allocation goes from negligible to ~20% of time spent. I guess next step will be to figure out a way to reuse things instead of dropping and creating new ones

PSeitz commented 1 year ago
        for token_filter in &self.token_filters {
            token_stream = token_filter.transform(token_stream);
        }

SimpleTokenizer uses a default Token, which maybe uses a too large text capacity.

impl Default for Token {
    fn default() -> Token {
        Token {
            offset_from: 0,
            offset_to: 0,
            position: usize::MAX,
            text: String::with_capacity(200),
            position_length: 1,
        }
    }
}
trinity-1686a commented 1 year ago

interestingly halving the default text capacity makes both benchmarks slightly slower on my system (though possibly within margin of error). This might be very allocator dependent, I guess mine is roughly constant time wrt allocation size.

PSeitz commented 1 year ago

https://github.com/quickwit-oss/tantivy/pull/2062 removes the Token::default creation for each text encountered, by having a shared one on the tokenizer

index-hdfs/index-hdfs-no-commit
                        time:   [644.80 ms 648.41 ms 652.51 ms]
                        change: [-4.1369% -2.7259% -1.3721%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)

index-hdfs/index-hdfs-with-array-no-commit
                        time:   [800.76 ms 803.53 ms 808.09 ms]
                        change: [-5.1855% -4.7265% -4.0817%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)

Ideally the allocation for the BoxTokenStream per text would also be removed, but this may require some lifetime tricks.

trinity-1686a commented 1 year ago

after #2062, the main differences I see left in flamegraphs are:

trinity-1686a commented 1 year ago

there have been substantial progress wrt this issue. It's still somewhat slower, but a lot less than before. I'm unlikely to work more on this right now, so I'm unassigning myself