Discard document as an alternative to remove

The remove method undoes the effects of add, removing a document from all the internal data structures, so that it won't appear in later searches. From the point of view of memory, it is a clean way to remove a document, because it brings the internal structures to the same state as before the document was added.

Unfortunately, remove is sometimes cumbersome to use, mostly because it takes the whole document as argument. This gets particularly cumbersome when updating a document: in this case, the old version of the document has to be kept around to be removed, then the new version can be added. For a discussion on why passing the whole document is necessary, see for example this comment. Examples of issues related to this are #180, #45, and #22.

This pull request implements an alternative method called discard, with different trade-offs. The visible effect of calling discard is the same as remove: later searches will not return the document as a result. The difference is that discard takes only the document ID as argument, making it more ergonomic in the situations described above, and marks the current version of the document as discarded, so it will not appear in search results. On one hand this is faster and more ergonomic than remove, on the other hand the inverted index is not modified, thus memory is not immediately released.

const miniSearch = new MiniSearch({ fields: ['text'] })
const documents = [
  { id: 1, text: 'Some stuff' },
  { id: 2, text: 'Some more stuff' }
]

// Remove takes the full original document:
miniSearch.remove({ id: 1, text: 'Some stuff' })

// The equivalent with discard only needs the ID:
miniSearch.discard(1)

Also, a utility method replace is provided, which replaces an existing document with an updated version by internally calling discard and then add:

// Before, the only option was to remove and add:
miniSearch.remove({ id: 2, text: 'Some more stuff' })
miniSearch.add({ id: 2, text: 'Some more interesting stuff' })

// Now one can use replace, simply providing the updated document:
miniSearch.replace({ id: 2, text: 'Some more interesting stuff' })

Memory bloating caused by repetitive use of discard or replace is prevented by two mechanisms: clean up upon search, and vacuuming.

The first means that upon encountering discarded documents during a search, apart from skipping them, MiniSearch will also remove references to them from the inverted index for the search terms, partially releasing memory. This, though, only cleans up the index for terms that have been searched at least once after discarding a document, not for the full set of terms in the discarded document.

To completely clean up discarded documents from the inverted index, a vacuuming process is performed, either automatically (once the number of discarded documents reaches a threshold), or manually by calling the vacuum method. Vacuuming traverses all terms in the inverted index and cleans up discarded documents, releasing memory and returning a promise that resolves when the clean up is completed. Traversing all terms can take time on large indexes, therefore the work is performed in batches, with a delay between each batch to avoid blocking the thread for too long (the batch size and delay are configurable).

In summary, depending on the situation, remove or discard can be preferable:

remove releases memory immediately, and does not incur in the overhead of vacuuming, but requires the original document to be passed
discard is faster and only needs an ID, but relies on vacuuming to eventually release memory.

The pull request also adds a new internal structure that maps document IDs to internal "short IDs" used to identify document versions. While this adds a (usually small) memory usage increase, it also makes it possible to improve overall ergonomics:

The add method now throws an error in case of duplicated IDs (see issues #101 and #111 ).
The remove method is slightly faster, as it does not perform a linear search for the internal ID anymore.
The pull request also introduces a has(id) method to test if a document with the given ID was added to the index (solving, for example, #175 ).

lucaong / minisearch

Discard document as an alternative to remove #183