Closed olivernn closed 11 years ago
Is it a good idea to have to serialize the entire index on every change? How would that scale if you have a large index?
The way to resolve that might be to separate the concepts of 'add' and 'commit'- that way you'd have more control over when the write operations happen.
Instead of making the events work on a per-document basis, they should be organized in such a way that helps persisting the index change by change, but without the need to serialize the whole index at a time.
If it will only work by having to serialize the whole index, that means it will prevent lunr from being used in scenarios where the amount of data makes it impractical to do that.
One way around that might be to structure the index so you can keep track of changes, and have a commit operation where you just save back the modifications.
Yeah, that would be a very good thing, especially when using lunr on the server side. :)
The storing won't happen in lunr at all, these events are just to give users the hooks they need to implement their own storage.
You raise some valid points about having to serialise the whole index on each change, for some this won't be a problem, however for larger indexes this could be an excessive overhead. It'd be good to seem some benchmarks of serialisation to be sure though.
@garysieling makes a good point that whatever hooks into these events could be a bit smarter about how it actually stores the index.
Another way would be to store many smaller indexes and merge them together again when loading a previously serialised index, see #29 for more details.
My aim with this feature is to provide the required hooks to be able to store the index. A more sophisticated solution can, and probably should, be built on top of these basic events.
It would be great if there was an easy way of persisting lunr indexes data incrementally on IndexedDB. I'm trying to use it inside a shared worker and persisting on localStorage but it seems to not scale well as data volume grows.
Another useful approach would be to allow users of the API to specify their own storage backend for the index. This could mean, for example, to pass in an object with the necessary methods (eg. setValue, getValues, etc.) to store the inverted index, and the index would call these whenever the index changes or it needs data. The default storage backend could use indexed db in the browser, and we could implement our own way on the server as well.
Simple storage events have been added in the latest release of lunr. This allows you to be notified of changes to the index, e.g.
idx.on('add', function (doc) {
// do something here
})
idx.on('update', function (doc) {
// do something here
})
idx.on('remove', function (doc) {
// do something here
})
Adding storage events to a a
lunr.Index
would make it easy to snapshot an index to some storage location, whether that be localStorage in the browser or to file or some other database on the server.I think lunr would have to emit three events,
add
,update
andremove
, this should give users enough hooks to maintain a persisted copy of their index.It might work like this:
The callback signature would be the same for the
add
,update
andremove
events.The way update is implemented, it first removes a document and then re-adds it, would mean that some special care would be needed to make sure that only an
update
event is fired rather than theremove
and thenadd
event, but this should be simple enough.To support this all three methods,
add
,update
andremove
could take an argument that prevents any events from being emitted, this also might be useful when doing a bulk load.I don't think event handlers should be serialised, so when loading an index event handlers would have to re-added.