lucaong / minisearch

Tiny and powerful JavaScript full-text search engine for browser and Node
https://lucaong.github.io/minisearch/
MIT License
4.67k stars 133 forks source link

Discard document as an alternative to remove #183

Closed lucaong closed 1 year ago

lucaong commented 1 year ago

The remove method undoes the effects of add, removing a document from all the internal data structures, so that it won't appear in later searches. From the point of view of memory, it is a clean way to remove a document, because it brings the internal structures to the same state as before the document was added.

Unfortunately, remove is sometimes cumbersome to use, mostly because it takes the whole document as argument. This gets particularly cumbersome when updating a document: in this case, the old version of the document has to be kept around to be removed, then the new version can be added. For a discussion on why passing the whole document is necessary, see for example this comment. Examples of issues related to this are #180, #45, and #22.

This pull request implements an alternative method called discard, with different trade-offs. The visible effect of calling discard is the same as remove: later searches will not return the document as a result. The difference is that discard takes only the document ID as argument, making it more ergonomic in the situations described above, and marks the current version of the document as discarded, so it will not appear in search results. On one hand this is faster and more ergonomic than remove, on the other hand the inverted index is not modified, thus memory is not immediately released.

const miniSearch = new MiniSearch({ fields: ['text'] })
const documents = [
  { id: 1, text: 'Some stuff' },
  { id: 2, text: 'Some more stuff' }
]

// Remove takes the full original document:
miniSearch.remove({ id: 1, text: 'Some stuff' })

// The equivalent with discard only needs the ID:
miniSearch.discard(1)

Also, a utility method replace is provided, which replaces an existing document with an updated version by internally calling discard and then add:

// Before, the only option was to remove and add:
miniSearch.remove({ id: 2, text: 'Some more stuff' })
miniSearch.add({ id: 2, text: 'Some more interesting stuff' })

// Now one can use replace, simply providing the updated document:
miniSearch.replace({ id: 2, text: 'Some more interesting stuff' })

Memory bloating caused by repetitive use of discard or replace is prevented by two mechanisms: clean up upon search, and vacuuming.

The first means that upon encountering discarded documents during a search, apart from skipping them, MiniSearch will also remove references to them from the inverted index for the search terms, partially releasing memory. This, though, only cleans up the index for terms that have been searched at least once after discarding a document, not for the full set of terms in the discarded document.

To completely clean up discarded documents from the inverted index, a vacuuming process is performed, either automatically (once the number of discarded documents reaches a threshold), or manually by calling the vacuum method. Vacuuming traverses all terms in the inverted index and cleans up discarded documents, releasing memory and returning a promise that resolves when the clean up is completed. Traversing all terms can take time on large indexes, therefore the work is performed in batches, with a delay between each batch to avoid blocking the thread for too long (the batch size and delay are configurable).

In summary, depending on the situation, remove or discard can be preferable:

The pull request also adds a new internal structure that maps document IDs to internal "short IDs" used to identify document versions. While this adds a (usually small) memory usage increase, it also makes it possible to improve overall ergonomics:

maccman commented 1 year ago

This is a great way of solving the problem. Thank you.