typesense / firestore-typesense-search

Firebase Extension to automatically push Firestore documents to Typesense for full-text search with typo tolerance, faceting, and more
https://extensions.dev/extensions/typesense/firestore-typesense-search
Apache License 2.0
150 stars 27 forks source link

Out-of-order function triggers result in data inconsistency between Firestore -> Typesense #32

Closed rosslavery closed 1 year ago

rosslavery commented 2 years ago

Description

As noted in the Firebase functions documentation, background function triggers are not guaranteed to be received in the order the changes occured.

When rapid changes occur in the database, out-of-order events will result in out-of-order processing, meaning Typesense can end up with stale data in it which is not consistent with the final Firestore resting state after the influx of changes.

From all of the reading I've done, it is a fools-errand to try and fight the system to process them in order -- nevertheless without a solution, extensions like this are difficult to use in production environments.

Steps to reproduce

  1. Write a demo script / perform an app action rapidly that writes to the database
  2. Observe that the events are received out of order
  3. Quite often, this will result in Firestore containing the wrong data from an old event

Expected Behavior

Regardless of the order the events are received, only events newer than the last-processed event should be processed.

Actual Behavior

Inconsistent data in Typesense

Other Findings / Info

In trying to solve for this, I've attempted to use a Firestore collection to "lock" the item being updated, send to Typesense, and then unlock for future events after writing event context back to the lock endpoint.

Simplified example (RTDB trigger used here, but Firestore is affected by this as well):

.database.ref(`tasks/{itemId}`)
.onWrite(async (change, context) => {
  /**
   * We use a Firestore transaction to create a "lock" at a DB location
   * for a given `itemId`
   */
    const { timestamp, eventId } = context;
    const { itemId } = context.params;
    const timestampRef = firestore
      .collection(`typesenseLocks_tasks`)
      .doc(itemId);
    await admin.firestore().runTransaction(async transaction => {
      const dataChangedTimestamp = new Date(timestamp).getTime();
      const lastUpdatedTimestampDoc = await transaction.get(timestampRef);
      const lastUpdatedData = lastUpdatedTimestampDoc.data();

      /**
       * If this is the first time this document was changed (no previous locks),
       * or the last-stored lock timestamp is older than the current event's timestamp,
       * prepare a payload and send to Typesense.
       */
      if (
        (!lastUpdatedData?.timestamp ||
          dataChangedTimestamp > lastUpdatedData.timestamp) &&
        lastUpdatedData?.eventId !== eventId
      ) {
        // Send to Typesense
        await updateTypesense(change, indexer, itemId);

        // Finalize Transaction by writing the event timestamp and eventId back to Firestore for comparison by the next events
        transaction.set(timestampRef, {
          timestamp: dataChangedTimestamp,
          eventId
        });
      } else {
        /**
         * Do nothing, current event is older than last-indexed event already recorded, can be safely discarded
         */
      }
    });

In an attempt to fix this, I've tried moving the await updateTypesense(change, indexer, itemId); line inside and outside of the transaction function block, but in both scenarios stale data ends up inside Typesense.

From logs I've added, the above transaction appears to work, old events are dropped if a newer one has already been processed. But if two events that are extremely close together are processed in the correct order, Typesense occasionally ends up with the first event's payload.

This led me to believe that perhaps Typesense was parallelizing indexing operations, but since they are confirmed as serialized, I'm back at square one.

Metadata

Typsense Version: v0.23 OS: Typesense Cloud Relevant Slack Thread: https://typesense-community.slack.com/archives/C01P749MET0/p1655047418496499

jasonbosco commented 2 years ago

Thank you for documenting this Ross!

Summarizing our additional discussions on Slack:

One potential solution to this could be to query Firestore on each change trigger and push the latest version of the Firestore document to Typesense, instead of using the snapshot document from the event (which is what could be stale based on the order of execution of events of the change triggers).

Also, to avoid redundant calls to Typesense when changes occur in quick succession, as an optimization we could have a way to store the last updated timestamp in a "sync_state" Firestore collection for each document... Thing to consider with this is the $ cost of maintaining another Firestore collection.

OrionReed commented 1 year ago

Proposed Solution

Add a query firestore instead of using event payload flag in the config (exact name tbd).

Enabling this flag would do as @jasonbosco described above and read directly from Firestore, ensuring the latest data is pulled. This option would mean consistent syncs at the cost of extra queries. Exposing this as a config option lets users determine if the tradeoff is worthwhile to them.

OrionReed commented 1 year ago

Created a PR #42 for this suggested change. It's a starting point and is shamefully untested.

jasonbosco commented 1 year ago

@OrionReed Thank you for PR #42!

I've published the change in this RC release: v1.0.0-rc.3.

Could you visit the link above and use the installation link under "Installation Instructions" and let me know how it goes?

OrionReed commented 1 year ago

Switched over today, I'll report back with any problems after I let it run for a few days

jasonbosco commented 1 year ago

Feel free to re-open if you see any issues.