mikegoatly / lifti

A lightweight full text indexer for .NET
MIT License
184 stars 9 forks source link

Search for words with a `=` character #84

Closed stealthin closed 12 months ago

stealthin commented 12 months ago

Hello,

Firstable, thank you for this amazing library.

I tried to index and search for words, but it seems I can't get any result when I execute the following code:

var index  = new FullTextIndexBuilder<string>()
    .WithObjectTokenization<MyModel>(
        itemOptions => itemOptions
            .WithKey(b => b.Title)
            .WithField("Title", b => b.Title, tokenOptions => tokenOptions.CaseInsensitive().AccentInsensitive()))
    .Build();
await index.AddRangeAsync(new List<MyModel>
{
    new MyModel("1", "MyText=test3")
});
var results = index.Search("MyText=test3");
Console.WriteLine(results.Count());

public sealed record MyModel(string Key, string Title);

Would it be feasible to escape the = character that is part of the text itself? (I don't want the search text to be changed)

Many thanks!

stealthin commented 12 months ago

If I test the following code:

var index  = new FullTextIndexBuilder<string>()
    .WithObjectTokenization<MyModel>(
        itemOptions => itemOptions
            .WithKey(b => b.Title)
            .WithField("Title", b => b.Title, tokenOptions => tokenOptions.CaseInsensitive().AccentInsensitive().SplitOnPunctuation(false).IgnoreCharacters('=')))
    .Build();
await index.AddRangeAsync(new List<MyModel>
{
    new MyModel("1", "TEST=test3,TEST=othertest")
});
var tokenizer = index.GetTokenizerForField("Title");
var search = tokenizer.Normalize("TEST=test3");
var suggestions = GetSuggestions(search);
var results = index.Search(search);
Console.WriteLine(results.Count());

IEnumerable<string> GetSuggestions(string input)
{
    using var navigator = index.CreateNavigator();
    navigator.Process(input.AsSpan());
    return navigator.EnumerateIndexedTokens().ToList();
}
public sealed record MyModel(string Key, string Title);

I get TESTTEST3,TESTOTHERTEST from suggestions, which is not what I'd like. It should be TEST=test3,TEST=othertest instead.

mikegoatly commented 12 months ago

Hi!

The = in a LIFTI query is used to restrict a search to a specific field. At the moment there's no syntax to escape that. If you don't need the full LIFTI query syntax, then this should work for you:

var index  = new FullTextIndexBuilder<string>()
+    .WithSimpleQueryParser()
    .WithObjectTokenization<MyModel>(
        itemOptions => itemOptions
            .WithKey(b => b.Title)
            .WithField("Title", b => b.Title, tokenOptions => tokenOptions.CaseInsensitive().AccentInsensitive()))
    .Build();
await index.AddRangeAsync(new List<MyModel>
{
    new MyModel("1", "MyText=test3")
});
var results = index.Search("MyText=test3");
Console.WriteLine(results.Count());

public sealed record MyModel(string Key, string Title);

I see what you were trying to do with the .IgnoreCharacters('=') part of the second example, but IgnoreCharacters actually strips a set of characters from the input as if they were never there, so that's definitely not what you want :)

stealthin commented 12 months ago

Crystal clear! Thank you for your quick answer!

mikegoatly commented 12 months ago

No problem - glad I could help. I just had another thought - you could have also worked around this using a manually constructed query:

var index  = new FullTextIndexBuilder<string>()
.WithObjectTokenization<MyModel>(
itemOptions => itemOptions
.WithKey(b => b.Title)
.WithField("Title", b => b.Title, tokenOptions => tokenOptions.CaseInsensitive().AccentInsensitive()))
.Build();
await index.AddRangeAsync(new List<MyModel>
    {
        new MyModel("1", "MyText=test3")
    });

+ var normalizedSearchText = index.GetTokenizerForField("Title").Normalize("MyText=test3");
+ var query = new Query(new ExactWordQueryPart(normalizedSearchText));
+ var results = index.Search(query);
Console.WriteLine(results.Count());

That way you're bypassing the query parser completely and are being explicit about the fact you want the = in the word.

stealthin commented 12 months ago

That makes sense! This is what I ended up doing 😄