mikegoatly / lifti

A lightweight full text indexer for .NET
MIT License
184 stars 9 forks source link

v6 documentation changes #96

Closed mikegoatly closed 10 months ago

mikegoatly commented 11 months ago

Tracking documentation changes for v6

Release notes:

New features

Performance increases

There was a significant amount of work done to improve performance and memory usage of building an index, index (de)serialization and searching.

All tests were run with Benchmark.NET: BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22631.3007) Intel Core i7-1065G7 CPU 1.30GHz, 1 CPU, 8 logical and 4 physical cores The results below are a comparison of the previous v5 version of LIFTI against the code in the v6.0.0 branch, running on .NET 8.

Index construction

Populating an index with 200 Wikipedia entries in a single batch

v5 Mean (μs) v5 Allocated (KB) v6 Mean (μs) v6 Allocated (KB)
1,134.2 567,623.8 952.6 286,617.6

Populating each of the 200 Wikipedia entries one at a time (i.e. a new snapshot created after each document)

v5 Mean (μs) v5 Allocated (KB) v6 Mean (μs) v6 Allocated (KB)
4,284.4 1,370,649.9 1,212.4 613,540.2

Searching

Lots of individual optimisations including:

make for some nice gains for various query types.

Query v5 Mean (μs) v5 Allocated (KB) v6 Mean (μs) v6 Allocated (KB)
"also has a" 169.74 379.19 52.71 122.97
(confiscation & th*) | "and they" 1,203.69 1,557.29 105.23 185.02
* 193,333.07 103,612.99 62,298.80 13,152.30
?and ?they ?also 1,725.66 1,658.12 439.60 243.45
and they 417.70 819.98 104.23 218.21
and ~ they 132.89 294.22 42.20 95.61
and ~10> they 132.64 297.67 43.34 97.04
and > they 214.03 455.75 106.16 169.17
and they also 283.82 565.34 56.02 109.51
co*on 445.27 798.77 180.04 263.47
con??* 2.21 2.30 1.96 1.97
confiscation 4.03 2.70 3.66 2.29
th* 2,277.00 2,914.76 569.76 412.60
Title=?great 416.08 399.17 108.86 34.50

Deprecated:

ItemMetadata.Item/DocumentMetadata.Item -> use Key property IFullTextIndex.Items -> use Metadata property FullTextIndexBuilder.WithDuplicateItemBehavior -> use WithDuplicateKeyBehavior method IndexOptions.DuplicateItemBehavior -> use DuplicateKeyBehavior property ScoredToken.ItemId -> use DocumentId property QueryTokenMatch.ItemId -> use DocumentId property ItemMetadata.Count -> IndexMetadata.DocumentCount ItemMetadata.GetMetadata -> IndexMetadata.GetDocumentMetadata

Technically breaking

IdPool and IIdPool are now internal - These weren't really exposed before anyway Removed interface IItemMetadata - just using DocumentMetadata going forwards QueryContext no longer has ApplyTo method IIndexNavigator: added Snapshot property IIndexNavigator: added overloads for GetExactMatches and GetExactAndChildMatches that allow for the current QueryContext to be passed in so unnecessary results are not collected. IIndexNavigator: new additional methods AddExactMatches and AddExactAndChildMatches that allow you to efficiently collect matches using a DocumentMatchCollector before converting it to an IntermediateQueryResult. IQueryPart now has double CalculateWeighting(Func<IIndexNavigator> navigatorCreator) method to help the query processing logic evaluate the most efficient order of execution. TItem generic type parameter name has been renamed to TObject. All query part types are now sealed New method IIndexNavigator.ExactMatchCount() IntermediateQueryResult constructors are no longer public Index serialization interfaces have been reworked. This shouldn't affect anyone because it was technically impossible to write your own serializers based upon them due to a lack of publicly accessible methods for rehydrating an index. IIndexNavigatorBookmark now implements IDisposable - you don't technically have to dispose it, but doing so will return it to a pool and allow it to be reused.

Querying changes

ScoredFieldMatch is now quite different and no longer publicly constructable. The only place you would have encountered this is in a custom scorer, and that's no longer necessary.

Several types that are only likely to have been used internally are gone:

Breaking

DuplicateItemBehavior enum -> renamed to DuplicateKeyBehavior DuplicateItemBehavior.ReplaceItem -> use DuplicateKeyBehavior.Replace instead IQueryContext -> Just use concrete QueryContext this affects IQueryPart.Evaluate as it now takes QueryContext IIndexNodeFactory.CreateNode now takes concrete types ChildNodeMap and DocumentTokenMatchMap instead of ImmutableDictionary and ImmutableList respectively. A maximum of 31 different object types can now be configured against a single FullTextIndexBuilder (i.e. 31 distinct calls to WithObjectTokenization) - if anyone is actually indexing more that 31 object types, I'd be very interested to understand your scenario!

The rest of these will only affect you if you are explicitly referencing the type names in your code:

ItemPhrases -> renamed to DocumentPhrases ItemMetadata -> renamed to DocumentMetadata IItemStore -> renamed to IIndexMetadata