remove is quite brittle

maccman commented 1 year ago

For my use-case I'm re-indexing documents whenever they change.

If I call .add() with the same documents I'm getting duplicates.

So I'm calling remove() before adding the document.

But sometimes the document isn't in the index. I think remove() should return a boolean instead of throwing.

Otherwise I have to keep track of which documents I've indexed.

lucaong commented 1 year ago

Hi @maccman , I understand your point, and agree that from the application developer perspective remove could be made more ergonomic. That said, so far this was a design choice. The reason is that, in order to detect if a document was already in the index, MiniSearch would have these alternatives:

Either keep a set of document IDs that are in the index, but this would use more memory, even for the frequent cases when that is not needed. Memory compactness is quite important for MiniSearch to run comfortably on the client side.
Or use a linear search through all the indexed IDs (for example using the mapping between internal integer ID and "external" ID), which would introduce a possible performance bottleneck.
Or leave it to the application developer to keep track of which documents are in the index.

So far, MiniSearch has preferred the third option above. The reason is that it is usually not too complicated to do from the application developer perspective, and the benefit is relevant for all frequent cases when that is not needed.

That said, in your case it does get cumbersome. I am thinking of possible alternatives. One would be to add a hasDocument method, returning a boolean. Such method would have to be implemented with a linear search, but that could be documented, and honestly in most cases that's acceptable.

In theory, we could even add a constructor option to let the application developer decide whether to keep the set of indexed documents or not: if so, the hasDocument method would get more efficient, at the expense of higher memory utilization. I am not a big fan of adding too many options, but this sounds reasonable. I want to think about it a bit though.

lucaong commented 1 year ago

Ah sorry, I was mistaken: we already perform a linear search in remove. Therefore, it is ultimately a matter of taste about what to do when a document is not found in the index, whether to raise an error or to just return false.

Raising an error is, in my opinion, more appropriate: usually, attempting to delete a document that is not in the index indicates a bug in the application code, and it is therefore more useful to the application developer to raise an exception, so the problem surfaces. Returning false would amount to silently failing, hiding possible bugs.

If in a specific application, like yours, attempting to remove a document that is not in the index should be considered a normal occurrence and not an exception, it is perfectly fine to catch the error as you do.

In sum, I would keep the current API, and I think that your solution of catching the exception is perfectly admissible. Adding support for a hasDocument method like said above might also help, as you could check before attempting the deletion, but has the caveats explained above.

maccman commented 1 year ago

What do you think about 'add' doing an upsert? That would also get around the issue.

I can't think of a scenario where you'd want duplicates.

maccman commented 1 year ago

For reference, this is how I'm having to use MiniSearch:

import toPlainObject from 'lodash/toPlainObject'
import MiniSearchBase, {Options} from 'minisearch'

export class MiniSearch<T = any> extends MiniSearchBase<T> {
  documentMap: Map<string, T> = new Map()

  add(document: T): void {
    const {extractField, idField} = this._options
    const id = extractField(document, idField)

    const existingDocument = this.documentMap.get(id)

    if (existingDocument) {
      this.remove(existingDocument)
    }

    const documentClone = {...document}
    this.documentMap.set(id, documentClone)
    super.add(documentClone)
  }

  remove(document: T): void {
    const {extractField, idField} = this._options
    const id = extractField(document, idField)

    const existingDocument = this.documentMap.get(id)

    if (existingDocument) {
      this.documentMap.delete(id)
      super.remove(existingDocument)
    }
  }

  toJSON(): any {
    return {
      ...super.toJSON(),
      documentMap: toPlainObject(this.documentMap),
    }
  }

  static loadJS<T = any>(js: any, options: Options<T>): MiniSearch<T> {
    const miniSearch = new MiniSearch(options)
    miniSearch.documentMap = new Map(Object.entries(js.documentMap))
    return miniSearch
  }
}

lucaong commented 1 year ago

@maccman it might be useful to add that I am working on a pull request that would provide an alternative to remove that is more ergonomic, but with different trade offs. The pull request also includes a new has(id) method to test if a document with the given ID is present in the index or not, which would solve your issue.

lucaong commented 1 year ago

@maccman the feature is now released as v6.0.0-beta.1 (soon to be released as v6.0.0). Also the other issue you found with subclassing MiniSearch and overriding loadJS is fixed in the same release.

lucaong / minisearch

remove is quite brittle #175