Closed aguynamedben closed 6 years ago
I'll brainstorm a bit here.
So we maintain two collection source
and index
.
source
is the original collection and I like to keep it read only.index
is a "sliced & diced" version of source, where we precompute lowercase, split into words etc. We only copy indexed fields.dirty
is a state where index is out of sync with source. For example we set a new source, change indexed keys or change an option that affect indexing (like acronym)
We rebuild dirty index when we set new options OR during a search if needed.
The rebuild process clean the index
and call Add
on each item of the source
.
I haven't traced why your example doesn't work (while other works).
But at the root there might be a design issue.
We wanted the source to be read only, so calling Add()
manually only modify the index, which is only temporary and can be re-synced at any time. In fact making a search while having the dirty flag on will trigger a resync and thus clear late added data.
Additionally I think It make sense that add
does not append to source collection, because we loop add on each item of the source collection. This may be fixed by having a public add
and private _add
methods.
Downside is that you'll re-compute index from scratch. It's not that expensive if it's not often.
addSource
with the new list.Downside is duplicate over time, we won't use identify_item
to tidy up the source (but we still do it for
index
). I might kill public method Add
because the temporary nature of it is confusing.
source
as a shallow copy of original, then be free to update it as needed. In that scenario current Add
become private and an new public Add
is added that modify both source
and index
.Downside is the you lose some sync property as described in (A)
source
as a pointer to original collection.
I add an extra collection lateadd_source
to collect more definitively anything late Add()
- ed.
This also mean making current Add
private and exposing a different public Add
that merge into lateadd_source
as needed then call private _add
. On "dirty" index is rebuilt from both source and lateadd_source
Not sure which behavior should clear this temp collection. Maybe setting a new source.
CC @tehsven for having made the add functionality.
Hey @aguynamedben! I was able to get your example working by making keys
an array instead of a hash. The third document was being indexed incorrectly because the add
method assumes that the keys
param is an array. Unfortunately, it does not include the parsing logic for the keys
param, as does the constructor for the FuzzySearch engine, which allows it to be specified has a hash: https://github.com/jeancroy/FuzzySearch/blob/master/dist/FuzzySearch.js#L241
Changing this line:
const keys = {
title: 'title',
domain: 'domain',
};
to this:
const keys = ['title', 'domain'];
solves the problem:
Results for 'title:Item', round 1
{_id: 1, title: "Item 1", domain: "item1.com"}
{_id: 2, title: "Item 2", domain: "item2.com"}
There are 2 docs
Adding a doc
Doc added, now there are 3 docs, let's search 'em!
Results for 'title:Item', round 2
{_id: 1, title: "Item 1", domain: "item1.com"}
{_id: 2, title: "Item 2", domain: "item2.com"}
{_id: 3, title: "Item 3", domain: "item3.com"}
I’m blown away by the quality of these responses. Thanks so much. I’ll dig into the details, design, and potential workarounds/solutions, and see what I find. I think this is enough information for now, thank you.
I’m more than willing to spend a few hours doing grunt work on the project in return. Thanks for helping me figure this out! 🙏
Maybe a good first step is to amend the Add
method so it does not take keys
as input.
(Anyways I'm not sure the search would work properly if we give different keys for different items)
Instead, inside the Add
method, use this.keys
that has been pre-parsed by the constructor / option hook.
This would ripple up to _prepSource
but since it's always called like so
this._prepSource(this.source, this.keys);
I would have no problem removing both arguments of that method. I think it's an extension point that I never really used. The discussion about making late added item more permanent (points A-D) would probably change the way we deal with source(s) too. I'm open to renaming the method to something about index building.
Okay, I made a PR that incorporated many of the changes recommended here. Thanks for the guidance. It now works as expected in my code.
2 remaining questions:
this.source
when they're added with add()
. The 4 options you gave for handling this.source
upon an add()
were:
A. Modify this.source
at your own risk, i.e. do what I'm saying, but do it yourself.
B. Convert this.source
to Array this.sources
, allow users to add sources (i.e. batches of items) over time, and keep the added batches separate internally.
C. Make a shallow copy of source
(i.e. this.sourceInitial
) on it's way in, then go ahead and allow add()
to modify this.source
.
D. Keep this.source
the original collection and add new items to this.source_lateadd
.I guess my question is this: Why can't add()
simply add to this.source()
? Does that cause performance problems? Is it because Option B is cheaper in the case that only certain subsets of the data are updated? I.e. you can rebuild the index partially instead of from scratch?
Thanks again to the both of you for your guidance!
P.S. Feel free to be super hard on my pull request, I'm open to whatever criticism you can offer.
I'm OK with MIT license. package.json already say so I'm not sure why it's not recognised by Github.
I'm also open to properly register this with npm & yarn etc...
I'm OK with cleaning white space, although I think for markdown it is significant (ie: it force a linebreak) I'd have to double check the formatting after merging.
I guess my question is this: Why can't add() simply add to this.source()? Does that cause performance problems? Is it because Option B is cheaper in the case that only certain subsets of the data are updated? I.e. you can rebuild the index partially instead of from scratch?
No performance problem, source is never used except to build cache.
Basically I don't want to care what source is, as long as you give me instruction on how to read it (keys) I don't want to care if I have exclusive access to that collection or it's bound to something else not search related. To me, keeping the source read-only simplify a lot of concerns.
Another way to approach the problem is to focus on array sources, and provide some kind of hooks to deal with more complex cases.
If you ever have performance problem using add()
, one thing I can see missing is idempotency.
IE something that tell the index builder that the current entry is just fine, no need to recompute it.
To address this we can have a user defined same_item(a,b)
together with identify_item(a)
.
For example you may know that two items with the same Id will always have the same content.
Because we keep pointer to the original source item (this.index[n].item
) same can default to being the exact same object
With such a system I think the most elegant solution is then to let the user manage the source, and we provide an update()
method that do the bare minimum of work to update index. A bit in the spirit of a React dom diff.
We can still provide add()
as an utility function to manage source (and do a partial update on this entry).
That would be consistent with your needs I think.
P.S. Feel free to be super hard on my pull request, I'm open to whatever criticism you can offer.
When we are finished the warning on add should be removed. Ie current add
become _add
and the public add() will always be "safe" against index rebuild.
Also, for the better or the worst, snake_case was used for this project. sourceItem looks a bit weird in the sea of lowercase & underscore. I think I was doing a lot of python at the time.
I'd make explicit in comment that add modify source.
Okay, all of that makes a lot of sense, I’ll make another go at my PR tomorrow, check white space, snake_case consistent, and figure out add vs _add, maybe multiple sources. Thanks again for contributing and helping. On Tue, Dec 12, 2017 at 9:43 PM Jean Christophe Roy < notifications@github.com> wrote:
I'm OK with MIT license. package.json already say so https://github.com/jeancroy/FuzzySearch/blob/master/package.json#L28 I'm not sure why it's not recognised by Github.
I'm also open to properly register this with npm & yarn etc...
I'm OK with cleaning white space, although I think for markdown it is significant (ie: it force a linebreak) I'd have to double check the formatting after merging.
I guess my question is this: Why can't add() simply add to this.source()? Does that cause performance problems? Is it because Option B is cheaper in the case that only certain subsets of the data are updated? I.e. you can rebuild the index partially instead of from scratch?
No performance problem, source is never used except to build cache.
Basically I don't want to care what source is, as long as you give me instruction on how to read it (keys) I don't want to care if I have exclusive access to that collection or it's bound to something else not search related. To me, keeping the source read-only simplify a lot of concerns.
Another way to approach the problem is to focus on array sources, and provide some kind of hooks to deal with more complex cases.
If you ever have performance problem using add(), one thing I can see missing is idempotency. IE something that tell the index builder that the current entry is just fine, no need to recompute it.
To address this we can have a user defined same_item(a,b) together with identify_item(a). For example you may know that two items with the same Id will always have the same content. Because we keep pointer to the original source item (this.index[n].item) same can default to being the exact same object
With such a system I think the most elegant solution is then to let the user manage the source, and we provide an update() method that do the bare minimum of work to update index. A bit in the spirit of a React dom diff. We can still provide add() as an utility function to manage source (and do a partial update on this entry). That would be consistent with your needs I think.
P.S. Feel free to be super hard on my pull request, I'm open to whatever criticism you can offer.
When we are finished the warning on add should be removed. Ie current add become _add and the public add() will always be "safe" against index rebuild.
Also, for the better or the worst, snake_case was used for this project. sourceItem looks a bit weird in the sea of lowercase & underscore. I think I was doing a lot of python at the time.
I'd make explicit in comment that add modify source.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jeancroy/FuzzySearch/issues/9#issuecomment-351289377, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGlQaAbhuF6YGcaVydPOE0_ADQbfivJks5s_2QIgaJpZM4Q4xJd .
@jeancroy I updated my PR with the following changes:
add()
to simply push
the new source_item
onto this.source
. I see the benefit of truly not caring what source
is... that approach could allow a user to pass an object and a related identify_item
function that knows how to operate on the object. For now I basically followed your approach of "focus on array sources" and punted on "provide some kind of hooks to deal with more complex cases". De we know if anybody is using this project with non-array sources? I can keep adding more complex instrumentation (i.e. same_item()
, add_item
that allows user to describe how to add an incoming source_item
to source
), but I wasn't sure if that was necessary at this point.window
is presumed, but this could go away eventually)Thoughts?
PR #10 was merged and @jeancroy added some more sugar to it. Details on how to use add() will be added with https://github.com/jeancroy/FuzzySearch/pull/12/files. Closing for now.
Hi, like others, I can't believe not more people are aware of this library. I love it. Thanks you for working on it. I'm happy to contribute to the project if this is a bug that can be fixed or better docs are helpful.
I'm trying to use
FuzzySearch.prototype.add()
to add new documents to mysearcher
after initial index creation. After callingsearcher.add()
, I introspectsearcher.index.length
andsearcher.nb_indexed
and can see that the document has been added to the index. But when I callsearcher.search()
, the expected result is missing.Here's my test code:
Console output:
The expected result is that the Item 3 document would also show in the results, but it doesn't
Route 1
I noticed that there is a
FuzzySearch.prototype.dirty
property, and I've tried settingsearcher.dirty = true;
aftersearch.add()
but that doesn't fix the problem, and in fact it resets searcher.nb_indexed back to 2 if I do that. Maybe there is an issue where settingssearcher.dirty = true;
causes a refresh of the index based on original data, and documents added are lost?Route 2
Maybe I'm just not adding the documents in the right way?
Other explorations
The issue doesn't seem to be related to the usage of Tagged Search or
identify_item
.Any hints or ways you could point me are much appreciated. I'm happy to hunt things down. Thanks again for building this great library.