quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
12.16k stars 673 forks source link

Optimize single document posting lists. #1041

Open fulmicoton opened 3 years ago

fulmicoton commented 3 years ago

It is fairly common to use a term as a primary id.

When only indexing docids, redirecting to the posting list seems overkill. We could simply store the docid for terms that have doc_freq=1 right after the TermInfoBlock.

The index should end up being a tad smaller, and we would remove one seek.

PSeitz commented 2 years ago

We can encode up to 3 docs in the postings positions directly, instead of jumping into the posting list.

let postings_start_offset = u64::deserialize(reader)? as usize;
let postings_num_bytes = u32::deserialize(reader)? as usize;
let postings_end_offset = postings_start_offset + postings_num_bytes;

postings_start_offset would be [docid1_u32, docid2_u32] and postings_end_offset docid3_u32

fulmicoton commented 2 years ago

@PSeitz It is not working like that actually.

Terms are stored in blocks. The first term is serialized using the scheme you copy pasted. It is quite wasteful. The other terms are expressed as delta against this first term and bitpacked.