Indexing an array of small pieces of text is much slower than indexing a big string

fmassot commented 1 year ago

I still don't know if there is room for optimization there but the performance drop is significant. I observed on a dataset with documents containing a big array of small text, a drop of 25% in indexing throughput.

I pushed on a branch a file hdfs-with-array.json (hdfs json with a body made of array of text) to reproduce the performance drop in the bench. Here are the result:

Benchmarking index-hdfs/index-hdfs-no-commit: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 11.4s, or reduce sample count to 10
index-hdfs/index-hdfs-no-commit
                        time:   [538.24 ms 541.50 ms 545.32 ms]
                        change: [-3.6931% -2.4507% -1.2364%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
Benchmarking index-hdfs/index-hdfs-with-array-no-commit: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 17.7s, or reduce sample count to 10.
index-hdfs/index-hdfs-with-array-no-commit
                        time:   [879.80 ms 883.73 ms 888.34 ms]
                        change: [-2.6182% -1.1315% +0.0697%] (p = 0.12 > 0.05)
                        No change in performance detected.
Found 2 outliers among 20 measurements (10.00%)

Here is the branch: https://github.com/quickwit-oss/tantivy/tree/fmassot--bench-hdfs-with-array

PSeitz commented 1 year ago

here's a diff on the profile.

There's some noise, but we can see the alloc overhead for the array (array itself + multiple strings), slower deserialization, slower value_from_json.

9.61% +4.74% libc.so.6 [.] 0x000000000005a0d9 could be another alloc call, e.g. calloc.

Check IO/CPU overload!

# Event 'cycles'
#
# Baseline  Delta Abs  Shared Object                 Symbol                                                                                                                                                                                                                          >
# ........  .........  ............................  ................................................................................................................................................................................................................................>
#
    24.16%     -9.51%  index_bench-075aa16d259e3fd0  [.] crossbeam_channel::flavors::array::Channel<T>::send
     9.61%     +4.74%  libc.so.6                     [.] 0x000000000005a0d9
     0.87%     +3.53%  index_bench-075aa16d259e3fd0  [.] crossbeam_channel::context::Context::wait_until
    13.41%     -2.98%  index_bench-075aa16d259e3fd0  [.] <tantivy::postings::postings_writer::SpecializedPostingsWriter<Rec> as tantivy::postings::postings_writer::PostingsWriter>::subscribe
     9.31%     -2.32%  index_bench-075aa16d259e3fd0  [.] core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
     0.14%     +2.01%  index_bench-075aa16d259e3fd0  [.] <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
     3.24%     +1.57%  index_bench-075aa16d259e3fd0  [.] <tantivy::tokenizer::simple_tokenizer::SimpleTokenStream as tantivy::tokenizer::tokenizer::TokenStream>::advance
     0.96%     +1.27%  libc.so.6                     [.] malloc
     4.85%     -1.20%  libm.so.6                     [.] 0x0000000000067970
     0.38%     +1.10%  index_bench-075aa16d259e3fd0  [.] tantivy::schema::field_type::FieldType::value_from_json
     0.36%     +1.06%  libc.so.6                     [.] cfree
     3.48%     -0.98%  index_bench-075aa16d259e3fd0  [.] tantivy::postings::stacker::expull::ExpUnrolledLinkedListWriter::write_u32_vint
     1.14%     +0.95%  index_bench-075aa16d259e3fd0  [.] tantivy::schema::schema::Schema::json_object_to_doc
     3.15%     -0.88%  index_bench-075aa16d259e3fd0  [.] murmurhash32::murmurhash2::murmurhash2
     0.62%     +0.63%  index_bench-075aa16d259e3fd0  [.] alloc::slice::merge_sort
     1.83%     -0.46%  libm.so.6                     [.] exp
     1.17%     -0.43%  index_bench-075aa16d259e3fd0  [.] serde_json::de::from_str
     1.59%     -0.34%  index_bench-075aa16d259e3fd0  [.] <serde_json::read::StrRead as serde_json::read::Read>::parse_str
     2.31%     -0.31%  index_bench-075aa16d259e3fd0  [.] <tantivy::tokenizer::lower_caser::LowerCaserTokenStream as tantivy::tokenizer::tokenizer::TokenStream>::advance
     0.14%     +0.31%  index_bench-075aa16d259e3fd0  [.] core::hash::BuildHasher::hash_one
     0.09%     +0.30%  index_bench-075aa16d259e3fd0  [.] <tantivy::tokenizer::simple_tokenizer::SimpleTokenizer as tantivy::tokenizer::tokenizer::Tokenizer>::token_stream
     2.49%     -0.29%  index_bench-075aa16d259e3fd0  [.] core::str::iter::SplitInternal<P>::next
     0.92%     -0.25%  index_bench-075aa16d259e3fd0  [.] tantivy_common::vint::serialize_vint_u32
     0.05%     +0.24%  index_bench-075aa16d259e3fd0  [.] itertools::groupbylazy::GroupBy<K,I,F>::step

fmassot commented 1 year ago

Thanks for the report.

I moved the json parsing and document building out of the benchmark and I still see almost a factor of two between array VS no array (I updated the branch).

index-hdfs/index-hdfs-no-commit
                        time:   [467.03 ms 468.46 ms 470.16 ms]
                        change: [-53.083% -52.861% -52.630%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
Benchmarking index-hdfs/index-hdfs-with-array-no-commit: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 16.7s, or reduce sample count to 10.
index-hdfs/index-hdfs-with-array-no-commit
                        time:   [810.60 ms 814.71 ms 819.59 ms]
                        change: [-57.361% -57.112% -56.796%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild

fulmicoton commented 1 year ago

@trinity-1686a can you have a look at this?

trinity-1686a commented 1 year ago

I've modified slightly the bench so it does not include cloning the Vec<Document> and getting an index writer.

index-hdfs/index-hdfs-no-commit                        
                        time:   [5.8195 s 5.9066 s 5.9877 s]
index-hdfs/index-hdfs-with-array-no-commit                        
                        time:   [8.3026 s 8.3140 s 8.3286 s]

ratio = x1.40

perf samples:

index-hdfs/index-hdfs-no-commit                        
                        4981
index-hdfs/index-hdfs-with-array-no-commit                        
                        6904

ratio = 1.39 delta=1923

most of the delta are in:

drop_in_place/cfree (873 more samples, from 79), either dropping Documents (not much can be done about that) or BoxTokenStreams and their inner fields
TextAnalyser::tokenStream (405 more samples, from 48), allocating BoxToken{Stream,Filter} or String::with_capacity
TokenStream::Advance (311 more samples, from 980), almost everything is a bit slower inside, but nothing really stands out

They collectively account for 1589 of the 1923 more samples, or 82% of the difference, 66 being allocation/free.

So the TLDR is that allocation/free related operations are 10x slower when we multiply by 9.2 the number of field values, while keeping the same actual amount of raw content. And that makes allocation goes from negligible to ~20% of time spent. I guess next step will be to figure out a way to reuse things instead of dropping and creating new ones

PSeitz commented 1 year ago

We could avoid the Document cfree overhead, by just referencing the unserialized json text blocks (similar to serde_json_borrow). It would either require lifetimes on the tantivy's Value though.
The tokenization could be more efficient, currently the filters are applied on every text to be tokenized

        for token_filter in &self.token_filters {
            token_stream = token_filter.transform(token_stream);
        }

SimpleTokenizer uses a default Token, which maybe uses a too large text capacity.

impl Default for Token {
    fn default() -> Token {
        Token {
            offset_from: 0,
            offset_to: 0,
            position: usize::MAX,
            text: String::with_capacity(200),
            position_length: 1,
        }
    }
}

trinity-1686a commented 1 year ago

interestingly halving the default text capacity makes both benchmarks slightly slower on my system (though possibly within margin of error). This might be very allocator dependent, I guess mine is roughly constant time wrt allocation size.

PSeitz commented 1 year ago

https://github.com/quickwit-oss/tantivy/pull/2062 removes the Token::default creation for each text encountered, by having a shared one on the tokenizer

index-hdfs/index-hdfs-no-commit
                        time:   [644.80 ms 648.41 ms 652.51 ms]
                        change: [-4.1369% -2.7259% -1.3721%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)

index-hdfs/index-hdfs-with-array-no-commit
                        time:   [800.76 ms 803.53 ms 808.09 ms]
                        change: [-5.1855% -4.7265% -4.0817%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)

Ideally the allocation for the BoxTokenStream per text would also be removed, but this may require some lifetime tricks.

trinity-1686a commented 1 year ago

after #2062, the main differences I see left in flamegraphs are:

dropping the document being indexed (almost negligible with full strings, around 9% of samples for pre-split strings, creating the document is outside of my measured loop, but in practice it is probably impactful too). This may be heavily impacted by #2071
creating and dropping boxed tokenizers (again negligible with full string, around 4% of perf samples for pre-split strings). It will probably be a pain to remove that last box

trinity-1686a commented 1 year ago

there have been substantial progress wrt this issue. It's still somewhat slower, but a lot less than before. I'm unlikely to work more on this right now, so I'm unassigning myself

quickwit-oss / tantivy

Indexing an array of small pieces of text is much slower than indexing a big string #1654