lucaong / minisearch

Tiny and powerful JavaScript full-text search engine for browser and Node
https://lucaong.github.io/minisearch/
MIT License
4.67k stars 133 forks source link

create a removeById function #180

Closed adamdallis closed 1 year ago

adamdallis commented 1 year ago

Hi, I am building a browser extension and I index 3rd content, currently to update a document, I need to remove it first and then readd it. Although that approach has no issues I need to store the original document in order to be able to remove it. I would store half the data if i could just remove the document by id and not by the whole document itself.

thank you

lucaong commented 1 year ago

Hi @adamdallis , Thanks for your comment. I agree that a removeById feature would be very useful. Unfortunately, implementing it inside of MiniSearch has negative impacts on some of the design goals.

In particular, in order to implement a removeById feature, MiniSearch would have to keep a copy of all the documents that were added in the index. That's because de-indexing a document involves changing the inverted index to remove the deleted document from all the terms that point to it, so having the original document (or at least its tokens) is necessary. Pointing to the original documents is not enough, because the original document might change in place, breaking the removal. Search servers designed without the goal of running in the client can do that, because they can rely on disk to store the additional data, but for MiniSearch that would cause a much bigger memory utilization.

There are two possible approaches in my view:

The first, which is what MiniSearch does currently, is to only expose a remove(document) function, and let the application developer implement removal/update by ID, when necessary. For example, here is a scaffold implementation:

const documentById = {}

const miniSearch = new MiniSearch({ fields: ['text'] })

const addDocument = (document) => {
  if (documentById[document.id] != null) {
    throw "Document already exists"
  }

  documentById[document.id] = document
  miniSearch.add(document)
}

const updateDocument = (id, updates) => {
  const oldDocument = removeDocument(id)
  const updatedDocument = Object.assign(oldDocument, updates)
  addDocument(updatedDocument)
}

const removeDocument = (id) => {
  if (documentById[document.id] == null) {
    throw "No such document"
  }

  const doc = documentById[id]
  miniSearch.remove(doc)
  delete documentById[id]
  return doc
}

const searchDocument = (query) => {
  const results = miniSearch.search(query)
  return results.map((result) => documentById[result.id])
}

As long as the application only uses the helper functions and never updates documents in place, all would be working well (that's more or less the approach used in react-minisearch, which does provide a removeById function). Since this is reasonably simple to implement when needed, and does not negatively impact use cases when that is not necessary, this was the design choice so far.

Note that if MiniSearch would implement the same, it would have to copy the original documents, because MiniSearch cannot not control the code path that updates documents, and therefore cannot avoid in-place updates.

The other option would be to implement removeById by keeping a list of deleted documents, which would be removed from results. This is technically possible, even allowing updates (MiniSearch uses internal IDs that are different from the application IDs), but would degrade performance as more and more documents are deleted or updated: the inverted index would still keep the outdated entries, which would be discarded at search time.

I am considering such an option for a possible future major version, but as you can imagine it is not trivial to design a good API, and an implementation that would not suffer from obvious drawbacks, and would not inconvenience the frequent use cases that do not need deletion.

I hope this explains that.

lucaong commented 1 year ago

Hi @adamdallis , I am working on a pull request that would add a discard method, taking only the ID (like the removeById that you proposed) and producing the same visible effects as remove. The trade-off is that discard does not immediately release memory, and relies instead on a "vacuuming" of the index (triggered automatic by default once the number of discarded documents reaches a certain threshold).

You can read about the trade-offs in the pull request. If I decide to go for it, this change could be part of the next major release.

lucaong commented 1 year ago

@adamdallis the discard method (and related discardAll and replace methods), which works like remove but takes a document ID, is now released on NPM as v6.0.0-beta.1, and will be part of the upcoming v6.0.0 (see the documentation for more information). I am therefore closing this issue, but feel free to comment further if necessary.