jeancroy / FuzzySearch

:mag: Fast autocomplete suggestion engine using approximate string matching
MIT License
194 stars 32 forks source link

Proper way to use FuzzySearch.prototype.add() #9

Closed aguynamedben closed 6 years ago

aguynamedben commented 6 years ago

Hi, like others, I can't believe not more people are aware of this library. I love it. Thanks you for working on it. I'm happy to contribute to the project if this is a bug that can be fixed or better docs are helpful.

I'm trying to use FuzzySearch.prototype.add() to add new documents to my searcher after initial index creation. After calling searcher.add(), I introspect searcher.index.length and searcher.nb_indexed and can see that the document has been added to the index. But when I call searcher.search(), the expected result is missing.

Here's my test code:

    // TEST

    const docs = [
      { _id: 1, title: 'Item 1', domain: 'item1.com' },
      { _id: 2, title: 'Item 2', domain: 'item2.com' },
    ];

    const keys = {
      title: 'title',
      domain: 'domain',
    };

    const searcher = new FuzzySearch({
      source: docs,
      keys: keys,
      identify_item: function(doc) {
        return doc._id;
      },
    });

    let results;

    // First test search
    console.log(`Results for 'title:Item', round 1`);
    results = searcher.search('title:Item');
    results.forEach((result) => {
      console.log(result);
    });
    console.log(`There are ${searcher.nb_indexed} docs`);

    // Add a doc
    console.log(`Adding a doc`);
    const newDoc = { _id: 3, title: 'Item 3', domain: 'item3.com' };
    searcher.add(newDoc, keys);
    // Do I need searcher.dirty = true; here? That doesn't work either

    // Sanity check internals
    console.log(`Doc added, now there are ${searcher.nb_indexed} docs, let's search 'em!`);

    // Second test search
    console.log(`Results for 'title:Item', round 2`);
    results = searcher.search('title:Item');
    results.forEach((result) => {
      console.log(result);
    });

Console output:

Results for 'title:Item', round 1
commandDB.js:76 Object {_id: 1, title: "Item 1", domain: "item1.com"}
commandDB.js:76 Object {_id: 2, title: "Item 2", domain: "item2.com"}
commandDB.js:78 There are 2 docs
commandDB.js:81 Adding a doc
commandDB.js:86 Doc added, now there are 3 docs, let's search 'em!
commandDB.js:89 Results for 'title:Item', round 2
commandDB.js:92 Object {_id: 1, title: "Item 1", domain: "item1.com"}
commandDB.js:92 Object {_id: 2, title: "Item 2", domain: "item2.com"}

The expected result is that the Item 3 document would also show in the results, but it doesn't

Route 1

I noticed that there is a FuzzySearch.prototype.dirty property, and I've tried setting searcher.dirty = true; after search.add() but that doesn't fix the problem, and in fact it resets searcher.nb_indexed back to 2 if I do that. Maybe there is an issue where settings searcher.dirty = true; causes a refresh of the index based on original data, and documents added are lost?

Route 2

Maybe I'm just not adding the documents in the right way?

Other explorations

The issue doesn't seem to be related to the usage of Tagged Search or identify_item.

Any hints or ways you could point me are much appreciated. I'm happy to hunt things down. Thanks again for building this great library.

jeancroy commented 6 years ago

I'll brainstorm a bit here.

So we maintain two collection source and index.

dirty is a state where index is out of sync with source. For example we set a new source, change indexed keys or change an option that affect indexing (like acronym)

We rebuild dirty index when we set new options OR during a search if needed.

The rebuild process clean the index and call Add on each item of the source.


I haven't traced why your example doesn't work (while other works).
But at the root there might be a design issue.

We wanted the source to be read only, so calling Add() manually only modify the index, which is only temporary and can be re-synced at any time. In fact making a search while having the dirty flag on will trigger a resync and thus clear late added data.

Additionally I think It make sense that add does not append to source collection, because we loop add on each item of the source collection. This may be fixed by having a public add and private _add methods.

So what to do from now on?

Downside is that you'll re-compute index from scratch. It's not that expensive if it's not often.

Downside is duplicate over time, we won't use identify_item to tidy up the source (but we still do it for index). I might kill public method Add because the temporary nature of it is confusing.

Downside is the you lose some sync property as described in (A)

Not sure which behavior should clear this temp collection. Maybe setting a new source.


CC @tehsven for having made the add functionality.

tehsven commented 6 years ago

Hey @aguynamedben! I was able to get your example working by making keys an array instead of a hash. The third document was being indexed incorrectly because the add method assumes that the keys param is an array. Unfortunately, it does not include the parsing logic for the keys param, as does the constructor for the FuzzySearch engine, which allows it to be specified has a hash: https://github.com/jeancroy/FuzzySearch/blob/master/dist/FuzzySearch.js#L241

Changing this line:

const keys = {
  title: 'title',
  domain: 'domain',
};

to this:

const keys = ['title', 'domain'];

solves the problem:

Results for 'title:Item', round 1
{_id: 1, title: "Item 1", domain: "item1.com"}
{_id: 2, title: "Item 2", domain: "item2.com"}
There are 2 docs
Adding a doc
Doc added, now there are 3 docs, let's search 'em!
Results for 'title:Item', round 2
{_id: 1, title: "Item 1", domain: "item1.com"}
{_id: 2, title: "Item 2", domain: "item2.com"}
{_id: 3, title: "Item 3", domain: "item3.com"}
aguynamedben commented 6 years ago

I’m blown away by the quality of these responses. Thanks so much. I’ll dig into the details, design, and potential workarounds/solutions, and see what I find. I think this is enough information for now, thank you.

I’m more than willing to spend a few hours doing grunt work on the project in return. Thanks for helping me figure this out! 🙏

jeancroy commented 6 years ago

Maybe a good first step is to amend the Add method so it does not take keys as input. (Anyways I'm not sure the search would work properly if we give different keys for different items)

Instead, inside the Add method, use this.keys that has been pre-parsed by the constructor / option hook.

This would ripple up to _prepSource but since it's always called like so

this._prepSource(this.source, this.keys);

I would have no problem removing both arguments of that method. I think it's an extension point that I never really used. The discussion about making late added item more permanent (points A-D) would probably change the way we deal with source(s) too. I'm open to renaming the method to something about index building.

aguynamedben commented 6 years ago

Okay, I made a PR that incorporated many of the changes recommended here. Thanks for the guidance. It now works as expected in my code.

2 remaining questions:

I guess my question is this: Why can't add() simply add to this.source()? Does that cause performance problems? Is it because Option B is cheaper in the case that only certain subsets of the data are updated? I.e. you can rebuild the index partially instead of from scratch?

Thanks again to the both of you for your guidance!

P.S. Feel free to be super hard on my pull request, I'm open to whatever criticism you can offer.

jeancroy commented 6 years ago

I'm OK with MIT license. package.json already say so I'm not sure why it's not recognised by Github.

I'm also open to properly register this with npm & yarn etc...

I'm OK with cleaning white space, although I think for markdown it is significant (ie: it force a linebreak) I'd have to double check the formatting after merging.

I guess my question is this: Why can't add() simply add to this.source()? Does that cause performance problems? Is it because Option B is cheaper in the case that only certain subsets of the data are updated? I.e. you can rebuild the index partially instead of from scratch?

No performance problem, source is never used except to build cache.

Basically I don't want to care what source is, as long as you give me instruction on how to read it (keys) I don't want to care if I have exclusive access to that collection or it's bound to something else not search related. To me, keeping the source read-only simplify a lot of concerns.

Another way to approach the problem is to focus on array sources, and provide some kind of hooks to deal with more complex cases.

If you ever have performance problem using add(), one thing I can see missing is idempotency. IE something that tell the index builder that the current entry is just fine, no need to recompute it.

To address this we can have a user defined same_item(a,b) together with identify_item(a). For example you may know that two items with the same Id will always have the same content. Because we keep pointer to the original source item (this.index[n].item) same can default to being the exact same object

With such a system I think the most elegant solution is then to let the user manage the source, and we provide an update() method that do the bare minimum of work to update index. A bit in the spirit of a React dom diff.

We can still provide add() as an utility function to manage source (and do a partial update on this entry). That would be consistent with your needs I think.


P.S. Feel free to be super hard on my pull request, I'm open to whatever criticism you can offer.

When we are finished the warning on add should be removed. Ie current add become _add and the public add() will always be "safe" against index rebuild.

Also, for the better or the worst, snake_case was used for this project. sourceItem looks a bit weird in the sea of lowercase & underscore. I think I was doing a lot of python at the time.

I'd make explicit in comment that add modify source.

aguynamedben commented 6 years ago

Okay, all of that makes a lot of sense, I’ll make another go at my PR tomorrow, check white space, snake_case consistent, and figure out add vs _add, maybe multiple sources. Thanks again for contributing and helping. On Tue, Dec 12, 2017 at 9:43 PM Jean Christophe Roy < notifications@github.com> wrote:

I'm OK with MIT license. package.json already say so https://github.com/jeancroy/FuzzySearch/blob/master/package.json#L28 I'm not sure why it's not recognised by Github.

I'm also open to properly register this with npm & yarn etc...

I'm OK with cleaning white space, although I think for markdown it is significant (ie: it force a linebreak) I'd have to double check the formatting after merging.

I guess my question is this: Why can't add() simply add to this.source()? Does that cause performance problems? Is it because Option B is cheaper in the case that only certain subsets of the data are updated? I.e. you can rebuild the index partially instead of from scratch?

No performance problem, source is never used except to build cache.

Basically I don't want to care what source is, as long as you give me instruction on how to read it (keys) I don't want to care if I have exclusive access to that collection or it's bound to something else not search related. To me, keeping the source read-only simplify a lot of concerns.

Another way to approach the problem is to focus on array sources, and provide some kind of hooks to deal with more complex cases.

If you ever have performance problem using add(), one thing I can see missing is idempotency. IE something that tell the index builder that the current entry is just fine, no need to recompute it.

To address this we can have a user defined same_item(a,b) together with identify_item(a). For example you may know that two items with the same Id will always have the same content. Because we keep pointer to the original source item (this.index[n].item) same can default to being the exact same object

With such a system I think the most elegant solution is then to let the user manage the source, and we provide an update() method that do the bare minimum of work to update index. A bit in the spirit of a React dom diff. We can still provide add() as an utility function to manage source (and do a partial update on this entry). That would be consistent with your needs I think.

P.S. Feel free to be super hard on my pull request, I'm open to whatever criticism you can offer.

When we are finished the warning on add should be removed. Ie current add become _add and the public add() will always be "safe" against index rebuild.

Also, for the better or the worst, snake_case was used for this project. sourceItem looks a bit weird in the sea of lowercase & underscore. I think I was doing a lot of python at the time.

I'd make explicit in comment that add modify source.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jeancroy/FuzzySearch/issues/9#issuecomment-351289377, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGlQaAbhuF6YGcaVydPOE0_ADQbfivJks5s_2QIgaJpZM4Q4xJd .

aguynamedben commented 6 years ago

@jeancroy I updated my PR with the following changes:

image

Thoughts?

PR here: https://github.com/jeancroy/FuzzySearch/pull/10

aguynamedben commented 6 years ago

PR #10 was merged and @jeancroy added some more sugar to it. Details on how to use add() will be added with https://github.com/jeancroy/FuzzySearch/pull/12/files. Closing for now.