lucaong / minisearch

Tiny and powerful JavaScript full-text search engine for browser and Node
https://lucaong.github.io/minisearch/
MIT License
4.64k stars 133 forks source link

Can `loadJSON` be added as an instance method which merges indices? #247

Closed acnebs closed 1 month ago

acnebs commented 7 months ago

I'm having an issue where my index is getting periodically garbage-collected by the system, without the rest of my code doing so. So I end up in a situation where I'm expecting an index to be there, but it is not. I have implemented logic to check for when the index is smaller than I expect and re-index items, however this can be quite performance intensive depending on the size of the indexed content.

So I am also persisting the index to IndexedDb with JSON.stringify(minisearchInstance) and then loading it first. However Minisearch.loadJSON is a constructor, so I need to entirely replace my instance. However there are some instances when I have already indexed new items into my fresh index before I have finished deserializing my persisted index, at which point those documents are lost.

Would it be possible to make loadJSON an instance method of a MiniSearch instance which merged the deserialized index with whatever is already in the instance to address such a use case?

lucaong commented 7 months ago

Hi @acnebs , I need to think more deeply about it, but it is probably not possible to merge two indexes more efficiently than in a reindex.

That said, the index should never be garbage collected if your MiniSearch instance is still available. Under the hood, the index is referenced by a simple _index property in the MiniSearch instance object, so as long as the instance is reachable by live code, the garbage collector will not collect the index. I suspect something else might be going on in your app, and I think it is better to fix the root issue. Could you explain more about the issue? What do you observe when the index disappears?

Without knowing much about your system and issue, some question that could maybe point in the right direction:

acnebs commented 7 months ago

This is in the context of a capacitor app, and mostly happens on iOS.

The app itself creates a new instance of minisearch when the app first loads, indexing all the items. If the user leaves the app for a while, iOS eventually clears everything from memory. When you go back to the app though, it seems like Apple is doing some trickery (or capacitor is) so that the entire web app does not reload -- that initializing step which inits a class-based wrapper (as a singleton) around minisearch does not run.

But there are no items in the minisearch index anymore - they have been cleared.

lucaong commented 7 months ago

I see, thanks @acnebs for the explanation.

I would advise trying to figure out how to re-initialize the app: the MiniSearch instance and the index are ultimately made of plain JavaScript objects (although many nested ones for the index), so if iOS or Capacitor clears memory and does not re-initialize, in principle what you observe could happen with any other piece of state too.

That said, I will think if I can expose a method to efficiently load a serialized index on top of what is already indexed. Without having thought much about it, it does not sound promising, but I don't want to exclude it right away.

Meanwhile, one thing you could do on the application side, is to re-index only the documents that were added after the last time the index was persisted. For example, you could:

Note that this strategy won't work if you also perform deletions, but that's also the case with an "index merging" feature. If you also delete documents, you might need to keep track of which documents you added or deleted since the last serialization, and perform such updates on top of the serialized index upon recovery.

As you can see, it can get messy, so ideally you would find how to force proper re-initialization of the app.