ninjanye / SearchExtensions

Library of IQueryable extension methods to perform searching
MIT License
331 stars 52 forks source link

Additional test and bug fix for multiple search params #13

Closed mattbarkerdev closed 7 years ago

mattbarkerdev commented 8 years ago

hello again! Spotted an issue with the weighting expression the other day. When searching with multiple params, it was adding incorrect weighting if the searchphrase wasn't contained in the text. Added some new equality checks to the expression builders and its happy again. You only really need the EnumerableExpressionHelper.cs changes but i thought i'd send the tests too

ninjanye commented 8 years ago

Thanks!!! This looks good, I'll review this evening and merge it in. Just out of curiosity (I'm still trying to fully understand the search algorithm), am I right in saying that the longer the text being searched the more likely it is that the result will be near the top? For example, if a search term is found 3 characters in to a 10 character term, that will appear behind a result that is found as the 20th character in a 100 character term...?

mattbarkerdev commented 8 years ago

Morning - yes, for now the weighting based on the index of the search string to the length of the word does have the (probably undesired) side effect of over weighting words that are longer. For example

alberto will have a hit result of 5 however Berto will also have a result of 5 also francobertoli would have a result of 10

Haven't quite thought my way around how to prevent it increasing when the word is long. Given I'm currently only name matching with it, it hasn't come up too frequently (yet!) I suppose I could artificially deflate the hit count based on distance away from the start as opposed to increasing it due to the proximity to it but then i suspect I might have to handle negative hit results which is liable to get messy!

I'll have a think and get back to you!

ninjanye commented 8 years ago

Have you thought about using Levenshtein distance? SearchExtensions does support this (blog post) but only as an in memory action. You could return all matching records then order by the smallest distance (i.e. changes required).

I have put a fair amount of effort in to making the Levenshtein search perform well so it is able to calculate the distance of over a million records in under a second.

It might not be right for your problem, but just a thought

mattbarkerdev commented 8 years ago

that isn't a bad idea - I did see the implementation you put in. I'll try running some tests here and see what it comes up with! thanks for the idea!

mattbarkerdev commented 8 years ago

Morning! I haven't forgotten about this yet I promise! Just wondered if the first merge has made it into the nuget build yet? thanks!

ninjanye commented 8 years ago

Hi @mattbarker1983

The current version on nuget does not include your PR. I was hoping to take the opportunity to refactor some bits here and there, but like you, there is not enough time. I'll pick it back up and see if I can get it out.

Cheers

ninjanye commented 8 years ago

I was also wondering on the outcome of the validity of the search method. Did you manage to look into using Levenschtein distance at all?

P.S. I have updated the `.ToRanked() method to take an enum. Unfortunately this means this PR will need tweaking, sorry

mattbarkerdev commented 8 years ago

Yeah, time is a little pressed at the moment I must admit!

I did have a quick scan through the Levenschtein distance stuff. It looks like it could also work and yield the type of outcome that I'd expect. I was probably tired though as it looked like the implementation in the source compared one string to another rather than multiple to multiple. The usage currently is searching over three (maybe more in future) properties of an object against any number of incoming search phrases. Unless I'm missing something from last night, I couldn't see a way of combining them in the same way.

return regUsers .Search(s => s.FirstName, s => s.LastName, s => s.Title) .ContainingAll(SearchQueryItems) .ToLeftWeightedRanked() .OrderByDescending(r => r.Hits) .Select(r => r.Item);

ninjanye commented 8 years ago

Ah, ok. The multiple thing might not work straight off the bat. Ok, I'll create a new release from the master branch. Did you want me to wait for you to fix up this PR? (I broke it when I changed ToRanked() to take an overload)

mattbarkerdev commented 8 years ago

No worries! I should have some time this week to look at the PR. I'll grab the changes you made and fix it up. No hurry :) I'm sure we both have enough to be getting on with normally heh. I'll shout you when I've fixed it.

Cheers!

ninjanye commented 8 years ago

Hi @mattbarker1983

Something you mentioned a couple of posts ago was the inability to perform a Levenshtein search on more than one property and more than one search term. Well, it got me thinking and I've now added this as a feature to SearchExtensions. It can be used like this:

var result = context.Users.LevenshteinDistanceOf(x => x.FirstName, x => LastName)
                          .ComparedTo("Jim", "Fred");

It's not published to nuget just yet (I'm waiting to hit 10k downloads) but I wondered if this would be something that could be useful to you?

mattbarkerdev commented 8 years ago

Hi @ninjanye

Thanks for this! Yeah I believe it will come in handy! Sorry I've gone a little dark of late, I'm getting married next week so I've got quite a bit on my plate at the moment! Looks like it would work for our purposes though!

ninjanye commented 8 years ago

Congratulations! Hope the day goes well. With 10k downloads fast approaching, I've completed the Levenschtein work above. If possible, I'd like to temporarily remove the left weighted search stuff and release the Levenschtein update. This will allow me more time to become clear about the Left weighted search use case.

I really hate to have to ask as I don't get a lot of pull requests, and you've put a lot of effort into your work, which I really appreciate.

Once the latest release goes out, I'll reinstate the left weighted with and do some research into his often it might be used.

Cheers John

mattbarkerdev commented 8 years ago

Not a problem at all - it isn't quite ready and may in fact be redundant now if you're new work additions have the desired impact! on't worry about the time spent on it, I don't think its quite complete yet so I wouldn't worry!