Proper way to use FuzzySearch.prototype.add()

jeancroy / FuzzySearch

:mag: Fast autocomplete suggestion engine using approximate string matching

MIT License

194 stars 32 forks source link

Proper way to use FuzzySearch.prototype.add() #9

Closed aguynamedben closed 6 years ago

aguynamedben commented 6 years ago

Hi, like others, I can't believe not more people are aware of this library. I love it. Thanks you for working on it. I'm happy to contribute to the project if this is a bug that can be fixed or better docs are helpful.

I'm trying to use FuzzySearch.prototype.add() to add new documents to my searcher after initial index creation. After calling searcher.add(), I introspect searcher.index.length and searcher.nb_indexed and can see that the document has been added to the index. But when I call searcher.search(), the expected result is missing.

Here's my test code:

    // TEST

    const docs = [
      { _id: 1, title: 'Item 1', domain: 'item1.com' },
      { _id: 2, title: 'Item 2', domain: 'item2.com' },
    ];

    const keys = {
      title: 'title',
      domain: 'domain',
    };

    const searcher = new FuzzySearch({
      source: docs,
      keys: keys,
      identify_item: function(doc) {
        return doc._id;
      },
    });

    let results;

    // First test search
    console.log(`Results for 'title:Item', round 1`);
    results = searcher.search('title:Item');
    results.forEach((result) => {
      console.log(result);
    });
    console.log(`There are ${searcher.nb_indexed} docs`);

    // Add a doc
    console.log(`Adding a doc`);
    const newDoc = { _id: 3, title: 'Item 3', domain: 'item3.com' };
    searcher.add(newDoc, keys);
    // Do I need searcher.dirty = true; here? That doesn't work either

    // Sanity check internals
    console.log(`Doc added, now there are ${searcher.nb_indexed} docs, let's search 'em!`);

    // Second test search
    console.log(`Results for 'title:Item', round 2`);
    results = searcher.search('title:Item');
    results.forEach((result) => {
      console.log(result);
    });

Console output:

Results for 'title:Item', round 1
commandDB.js:76 Object {_id: 1, title: "Item 1", domain: "item1.com"}
commandDB.js:76 Object {_id: 2, title: "Item 2", domain: "item2.com"}
commandDB.js:78 There are 2 docs
commandDB.js:81 Adding a doc
commandDB.js:86 Doc added, now there are 3 docs, let's search 'em!
commandDB.js:89 Results for 'title:Item', round 2
commandDB.js:92 Object {_id: 1, title: "Item 1", domain: "item1.com"}
commandDB.js:92 Object {_id: 2, title: "Item 2", domain: "item2.com"}

The expected result is that the Item 3 document would also show in the results, but it doesn't

Route 1

I noticed that there is a FuzzySearch.prototype.dirty property, and I've tried setting searcher.dirty = true; after search.add() but that doesn't fix the problem, and in fact it resets searcher.nb_indexed back to 2 if I do that. Maybe there is an issue where settings searcher.dirty = true; causes a refresh of the index based on original data, and documents added are lost?

Route 2

Maybe I'm just not adding the documents in the right way?

Other explorations

The issue doesn't seem to be related to the usage of Tagged Search or identify_item.

Any hints or ways you could point me are much appreciated. I'm happy to hunt things down. Thanks again for building this great library.

jeancroy commented 6 years ago

I'll brainstorm a bit here.

So we maintain two collection source and index.

source is the original collection and I like to keep it read only.
index is a "sliced & diced" version of source, where we precompute lowercase, split into words etc. We only copy indexed fields.

dirty is a state where index is out of sync with source. For example we set a new source, change indexed keys or change an option that affect indexing (like acronym)

We rebuild dirty index when we set new options OR during a search if needed.

The rebuild process clean the index and call Add on each item of the source.

I haven't traced why your example doesn't work (while other works).
But at the root there might be a design issue.

We wanted the source to be read only, so calling Add() manually only modify the index, which is only temporary and can be re-synced at any time. In fact making a search while having the dirty flag on will trigger a resync and thus clear late added data.

Additionally I think It make sense that add does not append to source collection, because we loop add on each item of the source collection. This may be fixed by having a public add and private _add methods.

So what to do from now on?

A) One thing that can you can do now is to modify the source that you previously have given to FuzzySearch then set the dirty flag. I don't want the library to modify source, but user is free to do it.

Downside is that you'll re-compute index from scratch. It's not that expensive if it's not often.

B) Another option is to keep an array of source collection, instead of a single source collection. That's a natural approach to the problem each time we receive a list of possible results from the server we would be able to do an addSource with the new list.

Downside is duplicate over time, we won't use identify_item to tidy up the source (but we still do it for index). I might kill public method Add because the temporary nature of it is confusing.

C) I may maintain source as a shallow copy of original, then be free to update it as needed. In that scenario current Add become private and an new public Add is added that modify both source and index.

Downside is the you lose some sync property as described in (A)

D) I keep source as a pointer to original collection. I add an extra collection lateadd_source to collect more definitively anything late Add() - ed. This also mean making current Add private and exposing a different public Add that merge into lateadd_source as needed then call private _add. On "dirty" index is rebuilt from both source and lateadd_source

Not sure which behavior should clear this temp collection. Maybe setting a new source.

CC @tehsven for having made the add functionality.

tehsven commented 6 years ago

Hey @aguynamedben! I was able to get your example working by making keys an array instead of a hash. The third document was being indexed incorrectly because the add method assumes that the keys param is an array. Unfortunately, it does not include the parsing logic for the keys param, as does the constructor for the FuzzySearch engine, which allows it to be specified has a hash: https://github.com/jeancroy/FuzzySearch/blob/master/dist/FuzzySearch.js#L241

Changing this line:

const keys = {
  title: 'title',
  domain: 'domain',
};

to this:

const keys = ['title', 'domain'];

solves the problem:

Results for 'title:Item', round 1
{_id: 1, title: "Item 1", domain: "item1.com"}
{_id: 2, title: "Item 2", domain: "item2.com"}
There are 2 docs
Adding a doc
Doc added, now there are 3 docs, let's search 'em!
Results for 'title:Item', round 2
{_id: 1, title: "Item 1", domain: "item1.com"}
{_id: 2, title: "Item 2", domain: "item2.com"}
{_id: 3, title: "Item 3", domain: "item3.com"}

aguynamedben commented 6 years ago

I’m blown away by the quality of these responses. Thanks so much. I’ll dig into the details, design, and potential workarounds/solutions, and see what I find. I think this is enough information for now, thank you.

I’m more than willing to spend a few hours doing grunt work on the project in return. Thanks for helping me figure this out! 🙏

jeancroy commented 6 years ago

Maybe a good first step is to amend the Add method so it does not take keys as input. (Anyways I'm not sure the search would work properly if we give different keys for different items)

Instead, inside the Add method, use this.keys that has been pre-parsed by the constructor / option hook.

This would ripple up to _prepSource but since it's always called like so

this._prepSource(this.source, this.keys);

I would have no problem removing both arguments of that method. I think it's an extension point that I never really used. The discussion about making late added item more permanent (points A-D) would probably change the way we deal with source(s) too. I'm open to renaming the method to something about index building.

aguynamedben commented 6 years ago

Okay, I made a PR that incorporated many of the changes recommended here. Thanks for the guidance. It now works as expected in my code.

2 remaining questions:

Would the project be open to adding the MIT License? I can do it in my PR if that's acceptable.
To me, it makes sense to add new items to this.source when they're added with add(). The 4 options you gave for handling this.source upon an add() were: A. Modify this.source at your own risk, i.e. do what I'm saying, but do it yourself. B. Convert this.source to Array this.sources, allow users to add sources (i.e. batches of items) over time, and keep the added batches separate internally. C. Make a shallow copy of source (i.e. this.sourceInitial) on it's way in, then go ahead and allow add() to modify this.source. D. Keep this.source the original collection and add new items to this.source_lateadd.

I guess my question is this: Why can't add() simply add to this.source()? Does that cause performance problems? Is it because Option B is cheaper in the case that only certain subsets of the data are updated? I.e. you can rebuild the index partially instead of from scratch?

Thanks again to the both of you for your guidance!

P.S. Feel free to be super hard on my pull request, I'm open to whatever criticism you can offer.

jeancroy commented 6 years ago

I'm OK with MIT license. package.json already say so I'm not sure why it's not recognised by Github.

I'm also open to properly register this with npm & yarn etc...

I'm OK with cleaning white space, although I think for markdown it is significant (ie: it force a linebreak) I'd have to double check the formatting after merging.

I guess my question is this: Why can't add() simply add to this.source()? Does that cause performance problems? Is it because Option B is cheaper in the case that only certain subsets of the data are updated? I.e. you can rebuild the index partially instead of from scratch?

No performance problem, source is never used except to build cache.

Basically I don't want to care what source is, as long as you give me instruction on how to read it (keys) I don't want to care if I have exclusive access to that collection or it's bound to something else not search related. To me, keeping the source read-only simplify a lot of concerns.

Another way to approach the problem is to focus on array sources, and provide some kind of hooks to deal with more complex cases.

If you ever have performance problem using add(), one thing I can see missing is idempotency. IE something that tell the index builder that the current entry is just fine, no need to recompute it.

To address this we can have a user defined same_item(a,b) together with identify_item(a). For example you may know that two items with the same Id will always have the same content. Because we keep pointer to the original source item (this.index[n].item) same can default to being the exact same object

With such a system I think the most elegant solution is then to let the user manage the source, and we provide an update() method that do the bare minimum of work to update index. A bit in the spirit of a React dom diff.

We can still provide add() as an utility function to manage source (and do a partial update on this entry). That would be consistent with your needs I think.

P.S. Feel free to be super hard on my pull request, I'm open to whatever criticism you can offer.

When we are finished the warning on add should be removed. Ie current add become _add and the public add() will always be "safe" against index rebuild.

Also, for the better or the worst, snake_case was used for this project. sourceItem looks a bit weird in the sea of lowercase & underscore. I think I was doing a lot of python at the time.

I'd make explicit in comment that add modify source.

aguynamedben commented 6 years ago

Okay, all of that makes a lot of sense, I’ll make another go at my PR tomorrow, check white space, snake_case consistent, and figure out add vs _add, maybe multiple sources. Thanks again for contributing and helping. On Tue, Dec 12, 2017 at 9:43 PM Jean Christophe Roy < notifications@github.com> wrote:

I'm OK with MIT license. package.json already say so https://github.com/jeancroy/FuzzySearch/blob/master/package.json#L28 I'm not sure why it's not recognised by Github.

I'm also open to properly register this with npm & yarn etc...

I'm OK with cleaning white space, although I think for markdown it is significant (ie: it force a linebreak) I'd have to double check the formatting after merging.

I guess my question is this: Why can't add() simply add to this.source()? Does that cause performance problems? Is it because Option B is cheaper in the case that only certain subsets of the data are updated? I.e. you can rebuild the index partially instead of from scratch?

No performance problem, source is never used except to build cache.

Basically I don't want to care what source is, as long as you give me instruction on how to read it (keys) I don't want to care if I have exclusive access to that collection or it's bound to something else not search related. To me, keeping the source read-only simplify a lot of concerns.

Another way to approach the problem is to focus on array sources, and provide some kind of hooks to deal with more complex cases.

If you ever have performance problem using add(), one thing I can see missing is idempotency. IE something that tell the index builder that the current entry is just fine, no need to recompute it.

To address this we can have a user defined same_item(a,b) together with identify_item(a). For example you may know that two items with the same Id will always have the same content. Because we keep pointer to the original source item (this.index[n].item) same can default to being the exact same object

With such a system I think the most elegant solution is then to let the user manage the source, and we provide an update() method that do the bare minimum of work to update index. A bit in the spirit of a React dom diff. We can still provide add() as an utility function to manage source (and do a partial update on this entry). That would be consistent with your needs I think.

P.S. Feel free to be super hard on my pull request, I'm open to whatever criticism you can offer.

When we are finished the warning on add should be removed. Ie current add become _add and the public add() will always be "safe" against index rebuild.

Also, for the better or the worst, snake_case was used for this project. sourceItem looks a bit weird in the sea of lowercase & underscore. I think I was doing a lot of python at the time.

I'd make explicit in comment that add modify source.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jeancroy/FuzzySearch/issues/9#issuecomment-351289377, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGlQaAbhuF6YGcaVydPOE0_ADQbfivJks5s_2QIgaJpZM4Q4xJd .

aguynamedben commented 6 years ago

@jeancroy I updated my PR with the following changes:

Revert unnecessary readme.md whitespace changes I made
Change variable in source.js to snake_case so style is more consistent with project
Add LICENSE.md with MIT license the way GitHub recommends (I don't think they pick up on package.json)
Allow add() to simply push the new source_item onto this.source. I see the benefit of truly not caring what source is... that approach could allow a user to pass an object and a related identify_item function that knows how to operate on the object. For now I basically followed your approach of "focus on array sources" and punted on "provide some kind of hooks to deal with more complex cases". De we know if anybody is using this project with non-array sources? I can keep adding more complex instrumentation (i.e. same_item(), add_item that allows user to describe how to add an incoming source_item to source), but I wasn't sure if that was necessary at this point.
Add tests! Uses Mocha, Chai, and JSDom (because access to window is presumed, but this could go away eventually)

Thoughts?

PR here: https://github.com/jeancroy/FuzzySearch/pull/10

aguynamedben commented 6 years ago

PR #10 was merged and @jeancroy added some more sugar to it. Details on how to use add() will be added with https://github.com/jeancroy/FuzzySearch/pull/12/files. Closing for now.