Dynamic fields (was: DictionaryTokenization)

kampilan commented 1 year ago

Total noob here, but rather then using a POCO as the source document I want to use a class that has a composite key and a DIctionary<string,string> for the "fields". There does not seem to be anyway to do this. Could I build a DictionaryTokenizationBuilder?

I would need to iterate through the pairs in the dictionary using the key for the Field name and the Value for the text that needs to tokenized. Am I barking up the wrong tree? Is this even possible or sensible?

Lifti is a perfect fit for my use case so I hope there's a way. Thanks Jim

mikegoatly commented 1 year ago

Hi @kampilan,

I'm starting by assuming your class definition looks something like this:

class Foo
{
    public string KeyPart1 { get; set; }
    public string KeyPart2 { get; set; }
    public IDictionary<string, string> Fields { get; set; }
}

Where both KeyPart1 and KeyPart2 need to be combined to make the composite key.

If the keys of your dictionary are well known and not subject to change, then you could get away with something like this:

var index = new FullTextIndexBuilder<string>()
    .WithObjectTokenization<Foo>(o => o
        .WithKey(c => $"{c.KeyPart1}|{c.KeyPart2}") // Use a concatenated string for the composite key
        .WithField("Name", c => c.Fields["Name"])
        .WithField("Info", c => c.Fields["Info"])
    )
    .Build();

However I have a strong suspicion that this isn't going to be what you need! 😊

I think this issue relates to concept of "dynamic fields", which is kind of where #66 was heading. How many unique keys are you expecting there to be in your dictionary?

kampilan commented 1 year ago

Hi Mike

You are spot on in your understanding of what I am trying to do. I don't want to do what you have suggested because the point of my strategy is to not have to know anything about the document I am trying to index. and then subsequently search. As for the number of fields, in my case the number would be small. 5 or 6 at most, but one might expect the same number you have on an object. Nirvana would be:

var index = new FullTextIndexBuilder()
    .WithObjectTokenization(o => o
        .WithKey(c => $"{c.KeyPart1}|{c.KeyPart2}") // Use a concatenated string for the composite key
        .WithFields(c=>c.Fields ); 
    )
    .Build();

Thanks for you time Jim

mikegoatly commented 1 year ago

@kampilan I'm going to work on this for v5 because there's going to be some breaking changes to some interfaces.

I have some initial work done in the v5.0.0-dynamic-fields branch.

There's a quick sample use in the test console project, which creates this index:

var objects = new Dictionary<int, TestObject>
{
    {
        1,
        new TestObject(
            1,
            "Some details",
            new Dictionary<string, string> { { "Name", "Joe Bloggs" }, { "Profile", "Just placeholder text here" } })
    },
    {
        2,
        new TestObject(
            2,
            "Chillin with orange juice",
            new Dictionary<string, string> { { "Name", "Just Bob" }, { "FavouriteExercise", "Jumping jacks" } })
    }
};

var index = new FullTextIndexBuilder<int>()
    .WithObjectTokenization<TestObject>(o => o
        .WithKey(c => c.Id)
        .WithField("Details", x => x.Details)
        .WithDynamicFields(c => c.Data)
    )
    .Build();

Which when run demonstrates that the dynamic fields are registered, searchable and can have matching search phrases extracted for them:

kampilan commented 1 year ago

That looks fantastic!. Can't wait to try. This will allow me to build a more generic implementation.

I have to tell you I have been enjoying working with Lifti. I have build a prototype that supports persistence and clustering.

Currently and perhaps erroneously, I am deserializing the Index from a cached byte array on each use. It seems to work very well. I did as a means of guaranteeing thread-safety until I read your FAQ and saw that the index in already thread-safe. My question is how expensive is the deserialization? Is my strategy a foolish waste? It has some simplicity benefits, but I don't went to do that at the expense of performance or resource utilization.

Finally, do you have any sense for what is the limit of Lifti? Is is a certain number of documents or keywords? Is it a memory limit?

Thanks so much Jim

mikegoatly commented 1 year ago

If you want to try it out, you can use the CI preview build from https://pkgs.dev.azure.com/goatly/LIFTI/_packaging/lifti-ci/nuget/v3/index.json - the last built version is 5.0.0-CI-20230612-170058:

With regards to deserializing the index on each request, that will cause the in-memory structures to be rebuilt on each operation. How "bad" that would be will depend on your usage and how frequently you're expecting it to happen, but keeping the same instances around will definitely be more efficient.

I've not really explored the limits of LIFTI other than the stress test of loading in and serializing 200 wikipedia articles. The number of documents is probably going to be less of a bottleneck than size of the documents and number of unique words in them. - the index structure is totally retained in memory as things stand. Something to watch as the index grows will be the performance of some of the more "expensive" queries such as wildcard and fuzzy matching.

kampilan commented 1 year ago

I will try the new version and let you know,

I have changed my implementation to cache the index itself not the serialized byte array. It's much faster' ;)

Thanks for the performance tips. I'm able to search 20000 documents with 20 keywords each in 10-20 msecs. The serialized size in 5.8 MB. I'll increase the keyword count in my next round of testing.

Thanks Jim

kampilan commented 1 year ago

Hi Mike

How does one interpret the scores (both at the Result and on the individual matche level)? What are the boundaries of these numbers? Can they be normalized into a percentage perhaps so the end-user can get a sense of the strength of the result?

I will be trying the latest version is weekend. Thanks so much Jim

mikegoatly commented 1 year ago

There is a bit of information about LIFTI scoring here, but the bottom line is that for each search term, LIFTI uses Okapi BM25 to score results. This isn't a bounded algorithm, so the best you can do is use the resulting score to order the results.

What's sometimes more useful for users is to see where in the document the search terms were actually found - you can get these by extracting matched phrases from search results.

kampilan commented 1 year ago

A million apologies for not consulting the docs first. I will investigate what you suggest above. I was just trying to rationalize the score into something that would suggest to the user the strength of the match. Thanks so much.

mikegoatly commented 1 year ago

No problem at all! Let me know how your get on.

h0lg commented 1 year ago

If you want to try it out, you can use the CI preview build from https://pkgs.dev.azure.com/goatly/LIFTI/_packaging/lifti-ci/nuget/v3/index.json - the last built version is 5.0.0-CI-20230612-170058

FYI this does the trick using the dotnet CLI: dotnet add package Lifti.Core --source https://pkgs.dev.azure.com/goatly/LIFTI/_packaging/lifti-ci/nuget/v3/index.json --version 5.0.0-CI-20230612-170058

mikegoatly commented 1 year ago

Massive thanks to @h0lg for spotting the issue with index serialization. I've updated the index serialization logic:

All fields (static and dynamic) is now written to the serialized file. This has allowed me to be a bit smarter on deserialization and handle edge cases where the index builder has been modified since the serialized file was created - as long as fields have only been added (i.e. not removed or renamed) then the internal field ids will be adjusted to fit.
I'm now writing numbers using variations of 7-bit encoding. This now means that the serialized file is likely going to be around 1/3 the size it was previously.

The API has changed a little, and you now need to register a dynamic field provider with a unique name:

.WithDynamicFields("ExtraData", c => c.Data)

This name is used to married up serialized dynamic fields with the relevant tokenizers, etc, when they are deserialized.

The latest CI version is 5.0.0-CI-20230620-190859

h0lg commented 1 year ago

@mikegoatly 5.0.0-CI-20230620-190859 works like a charm!

The new API feels way more natural for indexing nested objects. I was able to skip a whole lot of hoop-jumping in my updated implementation - love it!

I've just tested it by indexing 3000 YouTube videos with a combined JSON size of ~80Mb and it crunches the index down to ~43Mb. A simple search on the built index (including deserializing it before) takes 3 seconds on my dev rig.

There's only one minor issue concerning deserialization: Old indexes seem to deserialize without error - but searching them will throw with the updated tokenizer config (including dynamic fields).

I'd rather deal with indexes in an outdated format at deserialization time - because it's easier than guessing from the search error whether the problem might be an outdated index. Have you thought about how to deal with potentially outdated indexes - and if so, do you have an opinion or recommendation?

mikegoatly commented 1 year ago

@h0lg Thanks - I thought I'd handled deserialization of older indexes. The general approach I'm taking is:

The index header is inspected for the serializer version and an index reader of the appropriate version is used to read the data in the format it was written.
If the latest version of the serializer doesn't support deserializing from a version because of breaking changes in the index structure then an exception is thrown.
Indexes are always serialized in the latest version; there's currently no way to force an index to serialize in an older version format.

What's the error you're seeing?

h0lg commented 1 year ago

When I search with 5.0.0-CI-20230620-190859 an index built with 4.0.1, I get the following error:

Field id 4 has no associated field name

at Lifti.IndexedFieldLookup.GetFieldForId(Byte id) in D:\a\1\s\src\Lifti.Core\IndexedFieldLookup.cs:line 41
at Lifti.Querying.Query.<>c__DisplayClass7_0`1.<Execute>b__1(ScoredFieldMatch m) in D:\a\1\s\src\Lifti.Core\Querying\Query.cs:line 69
at System.Linq.Enumerable.SelectListIterator`2.ToList()
at Lifti.Querying.Query.Execute[TKey](IIndexSnapshot`1 index) in D:\a\1\s\src\Lifti.Core\Querying\Query.cs:line 66
at Lifti.FullTextIndex`1.Search(IQuery query) in D:\a\1\s\src\Lifti.Core\FullTextIndex.cs:line 264
at Lifti.FullTextIndex`1.Search(String searchText) in D:\a\1\s\src\Lifti.Core\FullTextIndex.cs:line 253

Removing the old index, the new one gets built, is less than half the size and searchable without an issue.

mikegoatly commented 1 year ago

I'm having trouble reproducing your exact issue @h0lg - would you mind sharing the index builder code before and after upgrading, and the v4 serialized index?

h0lg commented 1 year ago

@mikegoatly Sure - please have a look at the SubTubular branch lifti-index-object-graphs. Its last two commits contain one that updates Lifti from 4.0.1 to the 5.0 pre-release and the builder to use dynamic fields and one with instructions to build an index and reproduce the error.

Let me know if you have trouble reproducing the error or understanding anything over there.

mikegoatly commented 1 year ago

Got it sorted. I've added handling for this scenario to the older deserializers such that they verify that all the field ids in the index are present in the index they're being deserialized into. In your scenario you'll now get a LiftiException with the message:

Serialized index contains unknown field ids. Fields have most likely been removed from the FullTextIndexBuilder configuration.

I'll find some time to update the documentation and get v5 released, assuming I don't run into any other bugs 😊

The latest CI build is Lifti.Core.5.0.0-CI-20230702-070029 if you want to try it out yourself.

h0lg commented 1 year ago

@mikegoatly Great - looking forward to it!

One additional remark: I realized that the field names I used in my example contain spaces. That seems to work fine - except for field queries. But maybe I didn't know the right syntax for writing field queries for field names containing spaces.

If this was a valid field config, how would I write a field query for it?

.WithDynamicFields("Extra Data", c => c.Data)

To be clear - I don't need it to be. But if it isn't, you may want to discourage or mention that somehow. Throw an exception? Add a caveat about limited field searchability to the docs?

mikegoatly commented 1 year ago

Yeah, there needs to be a way to quote field names in the query syntax. I think for now I'll keep it as a documented limitation but raise a separate issue and pick it up in a later release.

mikegoatly commented 1 year ago

🥳 v5 is now pushed to nuget with dynamic fields!

mikegoatly / lifti

Dynamic fields (was: DictionaryTokenization) #69