lucaong / minisearch

Tiny and powerful JavaScript full-text search engine for browser and Node
https://lucaong.github.io/minisearch/
MIT License
4.67k stars 133 forks source link

Handling duplicate race conditions with `addAllAsync` #206

Closed acnebs closed 1 year ago

acnebs commented 1 year ago

Hi, was just wondering if there was a recommended way to handle "race conditions" when using the addAllAsync() method (or I suppose when using addAll but in separate async defined functions). Rather than try to explain too much, here is a repro of what I'm talking about.

import MiniSearch from "minisearch";
import { v4 } from "uuid";

const ms = new MiniSearch({
  fields: ["name"],
  storeFields: ["name"],
  searchOptions: {fuzzy: 0.8},
});

const genName = () => v4().split("-").join(" ");

const SET_SIZE = 1000;
const docs1 = [];
const docs2 = [];
let i = 0;
while (i < SET_SIZE) {
  const nid = v4();
  docs1.push({ id: nid, name: genName() });
  if (i % 11 == 0) docs2.push({ id: nid, name: genName() });
  else docs2.push({ id: v4(), name: genName() });
  i++;
}

// `addAllAsync` hard to coordinate
async function addSafelyAsync(docs) {
  docs.forEach((doc) => {
    if (ms.has(doc.id)) ms.discard(doc.id);
  });
  await ms.addAllAsync(docs);
}

ms.addAllAsync(docs1);
addSafelyAsync(docs2);
setTimeout(() => console.log(ms.search("cd").slice(0, 10)), 1000);

// though easy to encounter with `addAll` as soon as the event loop is involved
const ms2 = new MiniSearch({
  fields: ["name"],
  storeFields: ["name"],
  searchOptions: {fuzzy: 0.8},
});

function addSafely(docs) {
  docs.forEach((doc) => {
    if (ms2.has(doc.id)) ms2.discard(doc.id);
  });
  ms2.addAll(docs);
}

setTimeout(() => ms2.addAll(docs1), 10)
addSafely(docs2)

// sanity check search
setTimeout(() => console.log(ms2.search("cd").slice(0, 10)), 1000);

As you can see, I try to make a method called addSafely which iterates through the docs I'm trying to add and removes them first, but of course this doesn't work if addAllAsync is being used, or if addAll are coming in from different parts of the event loop. Is there a way to solve this in userland? Seems like something that realistically needs to be handled at the library level.

Thanks!

lucaong commented 1 year ago

Hi @acnebs , in general, consider that addAllAsync is generally slower than addAll (which is just a convenience method over add), and it is only really needed when indexing documents can block the main thread for too long.

That said, it might make sense to add methods addOrReplaceAll and addOrReplaceAllAsync to MiniSearch for this (I'll think about it). In the meanwhile, you can implement something like that yourself as:

const addOrReplaceAll = (documents) => {
  for (const document of documents) {
    if (ms.has(document.id)) { ms.discard(document.id) }
    this.add(document)
  }
}

const addOrReplaceAllAsync = (documents, options) => {
  const chunkSize = (options || {}).chunkSize || 10
  const acc = { chunk: [], promise: Promise.resolve() }

  const { chunk, promise } = documents.reduce(({ chunk, promise }, document, i) => {
    chunk.push(document)
    if ((i + 1) % chunkSize === 0) {
      return {
        chunk: [],
        promise: promise
          .then(() => new Promise(resolve => setTimeout(resolve, 0)))
          .then(() => {
            addOrReplaceAll(chunk)
          })
      }
    } else {
      return { chunk, promise }
    }
  }, acc)

  return promise.then(() => this.addOrReplaceAll(chunk))
}

I did not test this code, so please forgive possible bugs.

acnebs commented 1 year ago

Thanks for the response! I guess I should've dug into the actual source code a bit more, for some reason I assumed that addAll was not just a convenience wrapper for add – thought there might be some performance benefit to using it for many documents so that the algo complexity is not just O(n).

It seems like there is still the possibility of a race condition with the current system, as the has + discard check and the actual adding of the doc is not overall an atomic operation, but in practice that probably is a non-issue.

lucaong commented 1 year ago

Hi @acnebs , yeah, technically has + discard is not atomic, but on usual single-threaded JavaScript runtimes this won't matter, because nothing can occur between them. If it did matter, all sort of operations would be unsafe in JavaScript 🙂

As for addAll, yeah there is really no optimization over looping through add (I don't think that any meaningful optimization is possible there). Whereas removeAll does optimize for the case when it's called without arguments (but that is not relevant here).

lucaong commented 1 year ago

I will close the issue for now, as the question should be answered, but feel free to comment further, and I will reopen it if necessary.