pisa-engine / pisa

PISA: Performant Indexes and Search for Academia
https://pisa-engine.github.io/pisa/book
Apache License 2.0
932 stars 64 forks source link

Dynamic dispatch for block codecs #579

Closed elshize closed 3 months ago

elshize commented 7 months ago

Use dynamic dispatch for block codecs. The old class template block_freq_index is replaced by the BlockInvertedIndex class. This class contains a shared pointer to a BlockCodec object, which is a superclass of all block codec implementations.

Changelog-changed: BlockInvertedIndex replaces block_freq_index Changelog-performance: Expect a slight performance drop of all block codecs

codecov[bot] commented 7 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 93.86%. Comparing base (277cbe9) to head (6178ca5).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #579 +/- ## ========================================== + Coverage 93.27% 93.86% +0.58% ========================================== Files 89 80 -9 Lines 4448 3881 -567 ========================================== - Hits 4149 3643 -506 + Misses 299 238 -61 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

JMMackenzie commented 7 months ago

Benchmarking on MS-Marco v2 passages with BM25:

main

k=10
{"type": "block_simdbp", "query": "wand", "avg": 12999.1, "q50": 5843, "q90": 32519, "q95": 47480, "q99": 95925}
{"type": "block_simdbp", "query": "maxscore", "avg": 9178.09, "q50": 3963, "q90": 23918, "q95": 33600, "q99": 66648}
{"type": "block_simdbp", "query": "ranked_or", "avg": 458066, "q50": 387151, "q90": 968641, "q95": 1.2303e+06, "q99": 1.8538e+06}

k=1000
{"type": "block_simdbp", "query": "wand", "avg": 48258.5, "q50": 24904, "q90": 110636, "q95": 175410, "q99": 379647}
{"type": "block_simdbp", "query": "maxscore", "avg": 31727.1, "q50": 17407, "q90": 69608, "q95": 112551, "q99": 243058}
{"type": "block_simdbp", "query": "ranked_or", "avg": 458130, "q50": 390417, "q90": 964131, "q95": 1.22383e+06, "q99": 1.84154e+06}

this

k=10
{"type": "block_simdbp", "query": "wand", "avg": 12784.1, "q50": 5622, "q90": 32250, "q95": 47708, "q99": 95753}
{"type": "block_simdbp", "query": "maxscore", "avg": 9132.16, "q50": 3899, "q90": 23733, "q95": 35401, "q99": 66252}
{"type": "block_simdbp", "query": "ranked_or", "avg": 445143, "q50": 372223, "q90": 941900, "q95": 1.20444e+06, "q99": 1.8235e+06}

k=1000
{"type": "block_simdbp", "query": "wand", "avg": 48205.2, "q50": 24919, "q90": 111024, "q95": 175868, "q99": 374888}
{"type": "block_simdbp", "query": "maxscore", "avg": 31675.2, "q50": 17268, "q90": 69635, "q95": 112292, "q99": 247485}
{"type": "block_simdbp", "query": "ranked_or", "avg": 450663, "q50": 374939, "q90": 955607, "q95": 1.21881e+06, "q99": 1.87744e+06}
JMMackenzie commented 7 months ago

In other words, it looks to have no performance penalty? Seems kind of weird but I'll take it. We should double check that we're doing what we think we're doing I guess...

JMMackenzie commented 7 months ago

I tested stream_vbyte as well with MaxScore k=10, found no regressions.

main
{"type": "block_streamvbyte", "query": "maxscore", "avg": 9534.42, "q50": 4137, "q90": 24831, "q95": 34820, "q99": 68842}

this
{"type": "block_streamvbyte", "query": "maxscore", "avg": 9465.44, "q50": 4100, "q90": 24682, "q95": 34673, "q99": 68516}

Edit: "main" is not really main, it's just queries vs queries_dynamic in this branch. I think this can be merged once clang-format is happy?

elshize commented 7 months ago

@amallia FYI we did some testing here regarding dynamic dispatch for block codecs. Seems no performance regression.

I think we'll have to go for it. This will simplify the code, make the compilation faster, and so on. I'm planning to work on porting all the code to the new type-erased block codec as part of this PR. Let me know if you have any comment regarding the general idea; I'll let you both know when it's ready for review.

elshize commented 3 months ago

@JMMackenzie Thanks! Before merging, I'll clean up the git history here. I might just squash all of it, but at least I want it clean and have some important info in it, including some performance disclaimers.