Closed kampilan closed 1 year ago
Hi @kampilan,
I'm starting by assuming your class definition looks something like this:
class Foo
{
public string KeyPart1 { get; set; }
public string KeyPart2 { get; set; }
public IDictionary<string, string> Fields { get; set; }
}
Where both KeyPart1
and KeyPart2
need to be combined to make the composite key.
If the keys of your dictionary are well known and not subject to change, then you could get away with something like this:
var index = new FullTextIndexBuilder<string>()
.WithObjectTokenization<Foo>(o => o
.WithKey(c => $"{c.KeyPart1}|{c.KeyPart2}") // Use a concatenated string for the composite key
.WithField("Name", c => c.Fields["Name"])
.WithField("Info", c => c.Fields["Info"])
)
.Build();
However I have a strong suspicion that this isn't going to be what you need! 😊
I think this issue relates to concept of "dynamic fields", which is kind of where #66 was heading. How many unique keys are you expecting there to be in your dictionary?
Hi Mike
You are spot on in your understanding of what I am trying to do. I don't want to do what you have suggested because the point of my strategy is to not have to know anything about the document I am trying to index. and then subsequently search. As for the number of fields, in my case the number would be small. 5 or 6 at most, but one might expect the same number you have on an object. Nirvana would be:
var index = new FullTextIndexBuilder() .WithObjectTokenization (o => o .WithKey(c => $"{c.KeyPart1}|{c.KeyPart2}") // Use a concatenated string for the composite key .WithFields(c=>c.Fields ); ) .Build();
Thanks for you time Jim
@kampilan I'm going to work on this for v5 because there's going to be some breaking changes to some interfaces.
I have some initial work done in the v5.0.0-dynamic-fields branch.
There's a quick sample use in the test console project, which creates this index:
var objects = new Dictionary<int, TestObject>
{
{
1,
new TestObject(
1,
"Some details",
new Dictionary<string, string> { { "Name", "Joe Bloggs" }, { "Profile", "Just placeholder text here" } })
},
{
2,
new TestObject(
2,
"Chillin with orange juice",
new Dictionary<string, string> { { "Name", "Just Bob" }, { "FavouriteExercise", "Jumping jacks" } })
}
};
var index = new FullTextIndexBuilder<int>()
.WithObjectTokenization<TestObject>(o => o
.WithKey(c => c.Id)
.WithField("Details", x => x.Details)
.WithDynamicFields(c => c.Data)
)
.Build();
Which when run demonstrates that the dynamic fields are registered, searchable and can have matching search phrases extracted for them:
That looks fantastic!. Can't wait to try. This will allow me to build a more generic implementation.
I have to tell you I have been enjoying working with Lifti. I have build a prototype that supports persistence and clustering.
Currently and perhaps erroneously, I am deserializing the Index from a cached byte array on each use. It seems to work very well. I did as a means of guaranteeing thread-safety until I read your FAQ and saw that the index in already thread-safe. My question is how expensive is the deserialization? Is my strategy a foolish waste? It has some simplicity benefits, but I don't went to do that at the expense of performance or resource utilization.
Finally, do you have any sense for what is the limit of Lifti? Is is a certain number of documents or keywords? Is it a memory limit?
Thanks so much Jim
If you want to try it out, you can use the CI preview build from https://pkgs.dev.azure.com/goatly/LIFTI/_packaging/lifti-ci/nuget/v3/index.json - the last built version is 5.0.0-CI-20230612-170058
:
With regards to deserializing the index on each request, that will cause the in-memory structures to be rebuilt on each operation. How "bad" that would be will depend on your usage and how frequently you're expecting it to happen, but keeping the same instances around will definitely be more efficient.
I've not really explored the limits of LIFTI other than the stress test of loading in and serializing 200 wikipedia articles. The number of documents is probably going to be less of a bottleneck than size of the documents and number of unique words in them. - the index structure is totally retained in memory as things stand. Something to watch as the index grows will be the performance of some of the more "expensive" queries such as wildcard and fuzzy matching.
I will try the new version and let you know,
I have changed my implementation to cache the index itself not the serialized byte array. It's much faster' ;)
Thanks for the performance tips. I'm able to search 20000 documents with 20 keywords each in 10-20 msecs. The serialized size in 5.8 MB. I'll increase the keyword count in my next round of testing.
Thanks Jim
Hi Mike
How does one interpret the scores (both at the Result and on the individual matche level)? What are the boundaries of these numbers? Can they be normalized into a percentage perhaps so the end-user can get a sense of the strength of the result?
I will be trying the latest version is weekend. Thanks so much Jim
There is a bit of information about LIFTI scoring here, but the bottom line is that for each search term, LIFTI uses Okapi BM25 to score results. This isn't a bounded algorithm, so the best you can do is use the resulting score to order the results.
What's sometimes more useful for users is to see where in the document the search terms were actually found - you can get these by extracting matched phrases from search results.
A million apologies for not consulting the docs first. I will investigate what you suggest above. I was just trying to rationalize the score into something that would suggest to the user the strength of the match. Thanks so much.
No problem at all! Let me know how your get on.
If you want to try it out, you can use the CI preview build from https://pkgs.dev.azure.com/goatly/LIFTI/_packaging/lifti-ci/nuget/v3/index.json - the last built version is
5.0.0-CI-20230612-170058
FYI this does the trick using the dotnet CLI:
dotnet add package Lifti.Core --source https://pkgs.dev.azure.com/goatly/LIFTI/_packaging/lifti-ci/nuget/v3/index.json --version 5.0.0-CI-20230612-170058
Massive thanks to @h0lg for spotting the issue with index serialization. I've updated the index serialization logic:
The API has changed a little, and you now need to register a dynamic field provider with a unique name:
.WithDynamicFields("ExtraData", c => c.Data)
This name is used to married up serialized dynamic fields with the relevant tokenizers, etc, when they are deserialized.
The latest CI version is 5.0.0-CI-20230620-190859
@mikegoatly 5.0.0-CI-20230620-190859
works like a charm!
The new API feels way more natural for indexing nested objects. I was able to skip a whole lot of hoop-jumping in my updated implementation - love it!
I've just tested it by indexing 3000 YouTube videos with a combined JSON size of ~80Mb and it crunches the index down to ~43Mb. A simple search on the built index (including deserializing it before) takes 3 seconds on my dev rig.
There's only one minor issue concerning deserialization: Old indexes seem to deserialize without error - but searching them will throw with the updated tokenizer config (including dynamic fields).
I'd rather deal with indexes in an outdated format at deserialization time - because it's easier than guessing from the search error whether the problem might be an outdated index. Have you thought about how to deal with potentially outdated indexes - and if so, do you have an opinion or recommendation?
@h0lg Thanks - I thought I'd handled deserialization of older indexes. The general approach I'm taking is:
What's the error you're seeing?
When I search with 5.0.0-CI-20230620-190859
an index built with 4.0.1
, I get the following error:
Field id 4 has no associated field name
at Lifti.IndexedFieldLookup.GetFieldForId(Byte id) in D:\a\1\s\src\Lifti.Core\IndexedFieldLookup.cs:line 41
at Lifti.Querying.Query.<>c__DisplayClass7_0`1.<Execute>b__1(ScoredFieldMatch m) in D:\a\1\s\src\Lifti.Core\Querying\Query.cs:line 69
at System.Linq.Enumerable.SelectListIterator`2.ToList()
at Lifti.Querying.Query.Execute[TKey](IIndexSnapshot`1 index) in D:\a\1\s\src\Lifti.Core\Querying\Query.cs:line 66
at Lifti.FullTextIndex`1.Search(IQuery query) in D:\a\1\s\src\Lifti.Core\FullTextIndex.cs:line 264
at Lifti.FullTextIndex`1.Search(String searchText) in D:\a\1\s\src\Lifti.Core\FullTextIndex.cs:line 253
Removing the old index, the new one gets built, is less than half the size and searchable without an issue.
I'm having trouble reproducing your exact issue @h0lg - would you mind sharing the index builder code before and after upgrading, and the v4 serialized index?
@mikegoatly Sure - please have a look at the SubTubular branch lifti-index-object-graphs. Its last two commits contain one that updates Lifti from 4.0.1 to the 5.0 pre-release and the builder to use dynamic fields and one with instructions to build an index and reproduce the error.
Let me know if you have trouble reproducing the error or understanding anything over there.
Got it sorted. I've added handling for this scenario to the older deserializers such that they verify that all the field ids in the index are present in the index they're being deserialized into. In your scenario you'll now get a LiftiException
with the message:
Serialized index contains unknown field ids. Fields have most likely been removed from the FullTextIndexBuilder configuration.
I'll find some time to update the documentation and get v5 released, assuming I don't run into any other bugs 😊
The latest CI build is Lifti.Core.5.0.0-CI-20230702-070029 if you want to try it out yourself.
@mikegoatly Great - looking forward to it!
One additional remark: I realized that the field names I used in my example contain spaces. That seems to work fine - except for field queries. But maybe I didn't know the right syntax for writing field queries for field names containing spaces.
If this was a valid field config, how would I write a field query for it?
.WithDynamicFields("Extra Data", c => c.Data)
To be clear - I don't need it to be. But if it isn't, you may want to discourage or mention that somehow. Throw an exception? Add a caveat about limited field searchability to the docs?
Yeah, there needs to be a way to quote field names in the query syntax. I think for now I'll keep it as a documented limitation but raise a separate issue and pick it up in a later release.
🥳 v5 is now pushed to nuget with dynamic fields!
Total noob here, but rather then using a POCO as the source document I want to use a class that has a composite key and a DIctionary<string,string> for the "fields". There does not seem to be anyway to do this. Could I build a DictionaryTokenizationBuilder?
I would need to iterate through the pairs in the dictionary using the key for the Field name and the Value for the text that needs to tokenized. Am I barking up the wrong tree? Is this even possible or sensible?
Lifti is a perfect fit for my use case so I hope there's a way. Thanks Jim