Open fmassot opened 1 year ago
here's a diff on the profile.
There's some noise, but we can see the alloc overhead for the array (array itself + multiple strings), slower deserialization, slower value_from_json.
9.61% +4.74% libc.so.6 [.] 0x000000000005a0d9
could be another alloc call, e.g. calloc
.
Check IO/CPU overload!
# Event 'cycles'
#
# Baseline Delta Abs Shared Object Symbol >
# ........ ......... ............................ ................................................................................................................................................................................................................................>
#
24.16% -9.51% index_bench-075aa16d259e3fd0 [.] crossbeam_channel::flavors::array::Channel<T>::send
9.61% +4.74% libc.so.6 [.] 0x000000000005a0d9
0.87% +3.53% index_bench-075aa16d259e3fd0 [.] crossbeam_channel::context::Context::wait_until
13.41% -2.98% index_bench-075aa16d259e3fd0 [.] <tantivy::postings::postings_writer::SpecializedPostingsWriter<Rec> as tantivy::postings::postings_writer::PostingsWriter>::subscribe
9.31% -2.32% index_bench-075aa16d259e3fd0 [.] core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
0.14% +2.01% index_bench-075aa16d259e3fd0 [.] <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
3.24% +1.57% index_bench-075aa16d259e3fd0 [.] <tantivy::tokenizer::simple_tokenizer::SimpleTokenStream as tantivy::tokenizer::tokenizer::TokenStream>::advance
0.96% +1.27% libc.so.6 [.] malloc
4.85% -1.20% libm.so.6 [.] 0x0000000000067970
0.38% +1.10% index_bench-075aa16d259e3fd0 [.] tantivy::schema::field_type::FieldType::value_from_json
0.36% +1.06% libc.so.6 [.] cfree
3.48% -0.98% index_bench-075aa16d259e3fd0 [.] tantivy::postings::stacker::expull::ExpUnrolledLinkedListWriter::write_u32_vint
1.14% +0.95% index_bench-075aa16d259e3fd0 [.] tantivy::schema::schema::Schema::json_object_to_doc
3.15% -0.88% index_bench-075aa16d259e3fd0 [.] murmurhash32::murmurhash2::murmurhash2
0.62% +0.63% index_bench-075aa16d259e3fd0 [.] alloc::slice::merge_sort
1.83% -0.46% libm.so.6 [.] exp
1.17% -0.43% index_bench-075aa16d259e3fd0 [.] serde_json::de::from_str
1.59% -0.34% index_bench-075aa16d259e3fd0 [.] <serde_json::read::StrRead as serde_json::read::Read>::parse_str
2.31% -0.31% index_bench-075aa16d259e3fd0 [.] <tantivy::tokenizer::lower_caser::LowerCaserTokenStream as tantivy::tokenizer::tokenizer::TokenStream>::advance
0.14% +0.31% index_bench-075aa16d259e3fd0 [.] core::hash::BuildHasher::hash_one
0.09% +0.30% index_bench-075aa16d259e3fd0 [.] <tantivy::tokenizer::simple_tokenizer::SimpleTokenizer as tantivy::tokenizer::tokenizer::Tokenizer>::token_stream
2.49% -0.29% index_bench-075aa16d259e3fd0 [.] core::str::iter::SplitInternal<P>::next
0.92% -0.25% index_bench-075aa16d259e3fd0 [.] tantivy_common::vint::serialize_vint_u32
0.05% +0.24% index_bench-075aa16d259e3fd0 [.] itertools::groupbylazy::GroupBy<K,I,F>::step
Thanks for the report.
I moved the json parsing and document building out of the benchmark and I still see almost a factor of two between array VS no array (I updated the branch).
index-hdfs/index-hdfs-no-commit
time: [467.03 ms 468.46 ms 470.16 ms]
change: [-53.083% -52.861% -52.630%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
1 (5.00%) high mild
Benchmarking index-hdfs/index-hdfs-with-array-no-commit: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 16.7s, or reduce sample count to 10.
index-hdfs/index-hdfs-with-array-no-commit
time: [810.60 ms 814.71 ms 819.59 ms]
change: [-57.361% -57.112% -56.796%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
2 (10.00%) high mild
@trinity-1686a can you have a look at this?
I've modified slightly the bench so it does not include cloning the Vec<Document>
and getting an index writer.
index-hdfs/index-hdfs-no-commit
time: [5.8195 s 5.9066 s 5.9877 s]
index-hdfs/index-hdfs-with-array-no-commit
time: [8.3026 s 8.3140 s 8.3286 s]
ratio = x1.40
perf samples:
index-hdfs/index-hdfs-no-commit
4981
index-hdfs/index-hdfs-with-array-no-commit
6904
ratio = 1.39 delta=1923
most of the delta are in:
Document
s (not much can be done about that) or BoxTokenStream
s and their inner fieldsBoxToken{Stream,Filter}
or String::with_capacity
They collectively account for 1589 of the 1923 more samples, or 82% of the difference, 66 being allocation/free.
So the TLDR is that allocation/free related operations are 10x slower when we multiply by 9.2 the number of field values, while keeping the same actual amount of raw content. And that makes allocation goes from negligible to ~20% of time spent. I guess next step will be to figure out a way to reuse things instead of dropping and creating new ones
Document
cfree overhead, by just referencing the unserialized json text blocks (similar to serde_json_borrow). It would either require lifetimes on the tantivy's Value
though. for token_filter in &self.token_filters {
token_stream = token_filter.transform(token_stream);
}
SimpleTokenizer uses a default Token
, which maybe uses a too large text capacity.
impl Default for Token {
fn default() -> Token {
Token {
offset_from: 0,
offset_to: 0,
position: usize::MAX,
text: String::with_capacity(200),
position_length: 1,
}
}
}
interestingly halving the default text capacity makes both benchmarks slightly slower on my system (though possibly within margin of error). This might be very allocator dependent, I guess mine is roughly constant time wrt allocation size.
https://github.com/quickwit-oss/tantivy/pull/2062 removes the Token::default
creation for each text encountered, by having a shared one on the tokenizer
index-hdfs/index-hdfs-no-commit
time: [644.80 ms 648.41 ms 652.51 ms]
change: [-4.1369% -2.7259% -1.3721%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
index-hdfs/index-hdfs-with-array-no-commit
time: [800.76 ms 803.53 ms 808.09 ms]
change: [-5.1855% -4.7265% -4.0817%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
Ideally the allocation for the BoxTokenStream
per text would also be removed, but this may require some lifetime tricks.
after #2062, the main differences I see left in flamegraphs are:
there have been substantial progress wrt this issue. It's still somewhat slower, but a lot less than before. I'm unlikely to work more on this right now, so I'm unassigning myself
I still don't know if there is room for optimization there but the performance drop is significant. I observed on a dataset with documents containing a big array of small text, a drop of 25% in indexing throughput.
I pushed on a branch a file
hdfs-with-array.json
(hdfs json with a body made of array of text) to reproduce the performance drop in the bench. Here are the result:Here is the branch: https://github.com/quickwit-oss/tantivy/tree/fmassot--bench-hdfs-with-array