[x] Update binary serialization format with document score boost metadata and object id #72
[x] Query syntax support for bracketed field names #76
[x] Query syntax for escaping characters #85
[x] Custom stemmers #82
[x] Query processing order and query part weightings #105
Release notes:
New features
Removed dependency on System.Collections.Immutable - only the netstandard2 version of the library now pulls in any dependencies. For net6 to net8, only built in types are used.
Score boosting!
Score boosting as part of a query - grand^3 will boost the score of words matching "grand".
Boosting of object fields - .WithField("Name", c => c.Name, scoreBoost: 1.5D).
Boosting object scores based on a freshness date, e.g. the date it was last updated.
Boosting object scores based on a magnitude value, e.g. a star rating.
There was a significant amount of work done to improve performance and memory usage of building an index, index (de)serialization and searching.
All tests were run with Benchmark.NET:
BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22631.3007)
Intel Core i7-1065G7 CPU 1.30GHz, 1 CPU, 8 logical and 4 physical cores
The results below are a comparison of the previous v5 version of LIFTI against the code in the v6.0.0 branch, running on .NET 8.
Index construction
Populating an index with 200 Wikipedia entries in a single batch
v5 Mean (μs)
v5 Allocated (KB)
v6 Mean (μs)
v6 Allocated (KB)
1,134.2
567,623.8
952.6
286,617.6
Populating each of the 200 Wikipedia entries one at a time (i.e. a new snapshot created after each document)
v5 Mean (μs)
v5 Allocated (KB)
v6 Mean (μs)
v6 Allocated (KB)
4,284.4
1,370,649.9
1,212.4
613,540.2
Searching
Lots of individual optimisations including:
Merge sorting results during unions and intersections for queries containing more than one part
Optimised collection of effected results during wildcard and fuzzy match query parts
Early application of field filters when matching results
Weighting of query parts to analyse optimal execution order so that documents can be eliminated from collection in other parts of the query.
make for some nice gains for various query types.
Query
v5 Mean (μs)
v5 Allocated (KB)
v6 Mean (μs)
v6 Allocated (KB)
"also has a"
169.74
379.19
52.71
122.97
(confiscation & th*) | "and they"
1,203.69
1,557.29
105.23
185.02
*
193,333.07
103,612.99
62,298.80
13,152.30
?and ?they ?also
1,725.66
1,658.12
439.60
243.45
and
they
417.70
819.98
104.23
218.21
and ~ they
132.89
294.22
42.20
95.61
and ~10> they
132.64
297.67
43.34
97.04
and > they
214.03
455.75
106.16
169.17
and they also
283.82
565.34
56.02
109.51
co*on
445.27
798.77
180.04
263.47
con??*
2.21
2.30
1.96
1.97
confiscation
4.03
2.70
3.66
2.29
th*
2,277.00
2,914.76
569.76
412.60
Title=?great
416.08
399.17
108.86
34.50
Deprecated:
ItemMetadata.Item/DocumentMetadata.Item -> use Key property
IFullTextIndex.Items -> use Metadata property
FullTextIndexBuilder.WithDuplicateItemBehavior -> use WithDuplicateKeyBehavior method
IndexOptions.DuplicateItemBehavior -> use DuplicateKeyBehavior property
ScoredToken.ItemId -> use DocumentId property
QueryTokenMatch.ItemId -> use DocumentId property
ItemMetadata.Count -> IndexMetadata.DocumentCountItemMetadata.GetMetadata -> IndexMetadata.GetDocumentMetadata
Technically breaking
IdPool and IIdPool are now internal - These weren't really exposed before anyway
Removed interface IItemMetadata - just using DocumentMetadata going forwards
QueryContext no longer has ApplyTo method
IIndexNavigator: added Snapshot property
IIndexNavigator: added overloads for GetExactMatches and GetExactAndChildMatches that allow for the current QueryContext to be passed in so unnecessary results are not collected.
IIndexNavigator: new additional methods AddExactMatches and AddExactAndChildMatches that allow you to efficiently collect matches using a DocumentMatchCollector before converting it to an IntermediateQueryResult.
IQueryPart now has double CalculateWeighting(Func<IIndexNavigator> navigatorCreator) method to help the query processing logic evaluate the most efficient order of execution.
TItem generic type parameter name has been renamed to TObject.
All query part types are now sealed
New method IIndexNavigator.ExactMatchCount()IntermediateQueryResult constructors are no longer public
Index serialization interfaces have been reworked. This shouldn't affect anyone because it was technically impossible to write your own serializers based upon them due to a lack of publicly accessible methods for rehydrating an index.
IIndexNavigatorBookmark now implements IDisposable - you don't technically have to dispose it, but doing so will return it to a pool and allow it to be reused.
Querying changes
ScoredFieldMatch is now quite different and no longer publicly constructable. The only place you would have encountered this is in a custom scorer, and that's no longer necessary.
Several types that are only likely to have been used internally are gone:
FieldMatch
QueryTokenMatch
CompositeTokenMatchLocation
SingleTokenMatchLocation
ITokenLocationMatch
TokenLocationMatch
Breaking
DuplicateItemBehavior enum -> renamed to DuplicateKeyBehaviorDuplicateItemBehavior.ReplaceItem -> use DuplicateKeyBehavior.Replace instead
IQueryContext -> Just use concrete QueryContext this affects IQueryPart.Evaluate as it now takes QueryContextIIndexNodeFactory.CreateNode now takes concrete types ChildNodeMap and DocumentTokenMatchMap instead of ImmutableDictionary and ImmutableList respectively.
A maximum of 31 different object types can now be configured against a single FullTextIndexBuilder (i.e. 31 distinct calls to WithObjectTokenization) - if anyone is actually indexing more that 31 object types, I'd be very interested to understand your scenario!
The rest of these will only affect you if you are explicitly referencing the type names in your code:
ItemPhrases -> renamed to DocumentPhrasesItemMetadata -> renamed to DocumentMetadataIItemStore -> renamed to IIndexMetadata
Tracking documentation changes for v6
Release notes:
New features
System.Collections.Immutable
- only the netstandard2 version of the library now pulls in any dependencies. For net6 to net8, only built in types are used.grand^3
will boost the score of words matching "grand"..WithField("Name", c => c.Name, scoreBoost: 1.5D)
.Performance increases
There was a significant amount of work done to improve performance and memory usage of building an index, index (de)serialization and searching.
Index construction
Populating an index with 200 Wikipedia entries in a single batch
Populating each of the 200 Wikipedia entries one at a time (i.e. a new snapshot created after each document)
Searching
Lots of individual optimisations including:
make for some nice gains for various query types.
Deprecated:
ItemMetadata.Item
/DocumentMetadata.Item
-> useKey
propertyIFullTextIndex.Items
-> useMetadata
propertyFullTextIndexBuilder.WithDuplicateItemBehavior
-> useWithDuplicateKeyBehavior
methodIndexOptions.DuplicateItemBehavior
-> useDuplicateKeyBehavior
propertyScoredToken.ItemId
-> useDocumentId
propertyQueryTokenMatch.ItemId
-> useDocumentId
propertyItemMetadata.Count
->IndexMetadata.DocumentCount
ItemMetadata.GetMetadata
->IndexMetadata.GetDocumentMetadata
Technically breaking
IdPool
andIIdPool
are now internal - These weren't really exposed before anyway Removed interfaceIItemMetadata
- just usingDocumentMetadata
going forwardsQueryContext
no longer hasApplyTo
methodIIndexNavigator
: addedSnapshot
propertyIIndexNavigator
: added overloads forGetExactMatches
andGetExactAndChildMatches
that allow for the currentQueryContext
to be passed in so unnecessary results are not collected.IIndexNavigator
: new additional methodsAddExactMatches
andAddExactAndChildMatches
that allow you to efficiently collect matches using aDocumentMatchCollector
before converting it to anIntermediateQueryResult
.IQueryPart
now hasdouble CalculateWeighting(Func<IIndexNavigator> navigatorCreator)
method to help the query processing logic evaluate the most efficient order of execution.TItem
generic type parameter name has been renamed toTObject
. All query part types are now sealed New methodIIndexNavigator.ExactMatchCount()
IntermediateQueryResult
constructors are no longer public Index serialization interfaces have been reworked. This shouldn't affect anyone because it was technically impossible to write your own serializers based upon them due to a lack of publicly accessible methods for rehydrating an index.IIndexNavigatorBookmark
now implementsIDisposable
- you don't technically have to dispose it, but doing so will return it to a pool and allow it to be reused.Querying changes
ScoredFieldMatch
is now quite different and no longer publicly constructable. The only place you would have encountered this is in a custom scorer, and that's no longer necessary.Several types that are only likely to have been used internally are gone:
FieldMatch
QueryTokenMatch
CompositeTokenMatchLocation
SingleTokenMatchLocation
ITokenLocationMatch
TokenLocationMatch
Breaking
DuplicateItemBehavior
enum -> renamed toDuplicateKeyBehavior
DuplicateItemBehavior.ReplaceItem
-> useDuplicateKeyBehavior.Replace
insteadIQueryContext
-> Just use concreteQueryContext
this affectsIQueryPart.Evaluate
as it now takesQueryContext
IIndexNodeFactory.CreateNode
now takes concrete typesChildNodeMap
andDocumentTokenMatchMap
instead ofImmutableDictionary
andImmutableList
respectively. A maximum of 31 different object types can now be configured against a singleFullTextIndexBuilder
(i.e. 31 distinct calls toWithObjectTokenization
) - if anyone is actually indexing more that 31 object types, I'd be very interested to understand your scenario!The rest of these will only affect you if you are explicitly referencing the type names in your code:
ItemPhrases
-> renamed toDocumentPhrases
ItemMetadata
-> renamed toDocumentMetadata
IItemStore
-> renamed toIIndexMetadata