gurtejboparai / bookwyrm

A project for Software Engineering II WInter 2022
2 stars 3 forks source link

Make book searches token based #217

Closed CameronJung closed 2 years ago

CameronJung commented 2 years ago

Currently our book searches look for the whole string, a token based approach that ranks search results on how many tokens between the search query and the book title match would be much better for users.

LukeBMorrow commented 2 years ago

Did some digging into this. There is a way to do it using textscores which is a prebuilt search engine on the fields you indicate, but you have to pay for a mongo upgrade to use it. To do it using Spring would be computationally intensive for a large database. Should this be meta and implement it anyways knowing we'll never have a big DB of books?

CameronJung commented 2 years ago

I think this is a question the whole team should have a part in answering. You could put the question on the discord, or just wait for the next meeting. However, I would like to know if this is a Spring specific issue, since doing anything with a large enough database is computationally intensive. Or, is this just a matter of the optimization it would take?

LukeBMorrow commented 2 years ago

its just the type of task it is. We'll have to fetch the entire book table(not fast in itself) then do a manual text scoring and sorting.

I think we'd be looking at O(nmk + nlogn) where n is the number of books in the DB, m is the number of terms in the search and k is the complexity of String.contains(). That in addition to however slow it is to fetch the entire book table. Again, I dont think it would be noticeable for our current scale, but it's fun to consider bigger scales.

CameronJung commented 2 years ago

Yeah, but that's still in the order O(nlog(n)) which isn't that bad especially with unsorted data. And there would be a lot of room for optimization.

When I was in first year I had an assignment to write a program to do things like find anagrams in a dictionary of like 10000 or something words. I made an algorithm instead of determining what an entry was an anagram I first made it run through every possibility that would prove an entry can't be an anagram and then toss it from consideration. Than it would do the same thing with next possibility with the new smaller list and so on.

We could do something similar here. For example in a search for "Of Mice and Men" it could start by tossing out every book that doesn't have the longest word "mice" in the title. I'd be surprised if you ended up with many books after that massacre.

LukeBMorrow commented 2 years ago

I believe for time complexity cases you actually assume all variables are equal to n, so its on the order of O(n^3) assuming .contains is O(n). What was the time complexity of the your filter?

In either case, I'm going to need to start working on the presentation. So if someone else want's to implement it, they can, though I think there are more impactful things we can spend our time on than improving an already improved search.

CameronJung commented 2 years ago

I'm going to give this a try