jeancroy / FuzzySearch

:mag: Fast autocomplete suggestion engine using approximate string matching
MIT License
194 stars 32 forks source link

`remove()` to complement `add()` ? #30

Closed o0101 closed 2 years ago

o0101 commented 2 years ago

Hello, @jeancroy I LOVE your library! It's fantastic. So much better than the many other popular fuzzy search libraries.

A lot of your writing on the theory and algorithms implementation goes over my head and I don't have time to get into that, but I love that it's there. The mere presence of extensive intellectual stuff gives me extra confidence in your library, plus I can tell you really value the things you create, and aim to produce stuff that works well. So I give your library many stars! ⭐ ⭐ ⭐ ⭐ ⭐ ⭐

I also LOVE how you have an add method, it fits my use case perfectly. My use case is adding documents (and queries, but they are shorter, and are easier to handle), to an index inside of a desktop app that archives your browsing history so you can read it when you are offline. I recently added full text search to this (using flexsearch and ndx), but while these are good (and serializedable, to save time on restarting the app), they do not support typos. So I went searching for a fuzzy library, and after a few false starts, ended up here (actually via the GitHub search, not by npm search nor by Google search). I'm really thankful I did end up here.

Even if you can't implement any of these things (such as serialization, or remove), I am still immensely grateful for your library. Thank you for crafting such a thing of beauty and releasing into the world for all to use!

Anyway, I'd like there to be a remove, but I know it's not your job to make it, and I am sure you are in any case super busy. But let me just describe it here for the sake of the issue.

If I added a document to the index using add(), and then later removed() that document from the index, the document would not show up in search results, nor in serializations (such as they may be) of the index.

My particular motivation for this feature is that very often, a webpage (with the same URL) will add additional content (such as comments on a story), and my archiver would like to index those so that the person using the archiver can get (and search for) the most up-to-date version of the content they have seen.

Right now my solution is simply to discard all fz-search results that are older duplicates, but I am worried about the scalability of this, so have delayed deploying it to production. This is because people may re-index a page on the change of its content many times, but if I never remove those pages from the fuzzy search index, my index will end up dominated by pages that I no longer need and are out of date. So you can see I'm very scared of this possibility! It would be terrible!

But it's OK. If that's the case that you cannot implement a remove (or even serialization--I commented on another issue #21 ), I will still use your wonderful fz-search library for providing typeahead and suggestion for search queries (just like Google search), rather than on results. It will be a little sad to not be able to search with typos, but I'm sure I can figure out another way to do it somehow.

Please delete this issue if you already have this functionality (how embarrassing! 😱 )

jeancroy commented 2 years ago

So the main complication is that a lot of data is matched by position index in array.

So a temporary removal could be

this.source[idx] = null this.index[idx] = null this.store => iterate over all field to remove idx this.index_map => get unique id of item to remove, then invalidate this using -1, null ?

Then a lot of null check and invalid check.

A more complete removal would require to map all old position to new position. I'm not sure if this is worth the trouble if rebuilding the index take less than say 200ms Even if rebuilding the index take like a full second I'd invest in a low priority background worker instead of the complexity of patching the index to new position.

The add() method I can see useful where a client is being drip fed documents. The remove() I'm not sure of the use case. I suspect it's dependant of the serialization issue.

In any case, I'm not sure the benefit of having this in library vs a soft delete by the user of library.

I'm also working on the assumption that the index can be rebuilt on demand almost instantaneously, I just think it's a waste to recompute it on each search (but many library do pay that cost). In case of instant search one keystroke = one search.

jeancroy commented 2 years ago

As I said right now I gravitate invalidate & rebuild cache. But I'm interested in your rebuild time. And if they are unacceptable I'll see if it's worth to properly patch the cache.