microsoft / FluidFramework

Library for building distributed, real-time collaborative web applications
https://fluidframework.com
MIT License
4.71k stars 530 forks source link

Fluid runtime crashes on network error #22214

Open d2e opened 1 month ago

d2e commented 1 month ago

Describe the bug

At certain condition when there is a network issue between fluid client and Routelicious, It result in unhandled promise and as a result node process exits.

Exception that occur is ( GenericNetworkError(errorMessage, errorMessage.startsWith("NetworkError")) 2024-07-31T13:19:35.839Z 377aa069-abf3-4366-8300-be03a126d813 ERROR Unhandled Promise Rejection { "errorType": "Runtime.UnhandledPromiseRejection", "errorMessage": "Error: R11s fetch error: \r\n502 Bad Gateway\r\n\r\n

502 Bad Gateway

\r\n\r\n\r\n", "reason": { "errorType": "genericNetworkError", "errorMessage": "R11s fetch error: \r\n502 Bad Gateway\r\n\r\n

502 Bad Gateway

\r\n\r\n\r\n", "omitPropsFromLogging": {}, "_errorInstanceId": "29506e27-40f8-485b-ad56-23b989687a1c", "fluidErrorCode": "-", "statusCode": 502, "driverVersion": "2.0.0-internal.7.4.3", "canRetry": true, "endpointReached": true, "stack": [ "Error: R11s fetch error: \r", "502 Bad Gateway\r", "\r", "

502 Bad Gateway

\r", "\r", "\r", "", " at createR11sNetworkError (/var/task/node_modules/@fluidframework/routerlicious-driver/dist/errorUtils.cjs:61:21)", " at throwR11sNetworkError (/var/task/node_modules/@fluidframework/routerlicious-driver/dist/errorUtils.cjs:73:26)", " at RouterliciousStorageRestWrapper.request (/var/task/node_modules/@fluidframework/routerlicious-driver/dist/restWrapper.cjs:162:48)", " at runMicrotasks ()", " at processTicksAndRejections (node:internal/process/task_queues:96:5)", " at async /var/task/node_modules/@fluidframework/routerlicious-driver/dist/shreddedSummaryDocumentStorageService.cjs:100:31", " at async Function.timedExecAsync (/var/task/node_modules/@fluidframework/telemetry-utils/dist/logger.js:496:25)", " at async ShreddedSummaryDocumentStorageService.readBlob (/var/task/node_modules/@fluidframework/routerlicious-driver/dist/shreddedSummaryDocumentStorageService.cjs:95:23)" ] }, "promise": {}, "stack": [ "Runtime.UnhandledPromiseRejection: Error: R11s fetch error: \r", "502 Bad Gateway\r", "\r", "

502 Bad Gateway

\r", "\r", "\r", "", " at process. (file:///var/runtime/index.mjs:1276:17)", " at process.emit (node:events:525:35)", " at emit (node:internal/process/promises:140:20)", " at processPromiseRejections (node:internal/process/promises:274:27)", " at processTicksAndRejections (node:internal/process/task_queues:97:32)" ] }

To Reproduce

I am able to reproduce the error while using wifi as well i stop the wifi during the collaboration and then when fluid client is trying to create Steps to reproduce the behavior: I am able to reproduce though it require quite a bit of attempt, when i turn of wifi in mid of collaboration then also some times I am able to see this error Its very random I have observed 1 or 2 such issue on daily basis. Since one of the user is running fluid client on top of lambda as bot, unhandle rejection is causing it to crash. restWrapper.mjs:101 Uncaught (in promise) Error: NetworkError: Network request failed at restWrapper.mjs:101:1 at async VM41070 main.03c9eba.4219a8761b37ae28a6d6.js:207786:28 at async RouterliciousStorageRestWrapper.request (VM41070 main.03c9eba.4219a8761b37ae28a6d6.js:207784:21) at async VM41070 main.03c9eba.4219a8761b37ae28a6d6.js:208204:31 at async PerformanceEvent.timedExecAsync (VM41070 main.03c9eba.4219a8761b37ae28a6d6.js:48395:25) at async ShreddedSummaryDocumentStorageService.readBlob (VM41070 main.03c9eba.4219a8761b37ae28a6d6.js:208199:23)

Expected behavior

when I dig little bit I notice that catch is missing in ShreddedSummaryDocumentStorageService.readBlob function

public async readBlob(blobId: string): Promise { const cachedBlob = await this.blobCache?.get(this.getCacheKey(blobId)); if (cachedBlob) { return cachedBlob; }

    const value = await PerformanceEvent.timedExecAsync(
        this.logger,
        {
            eventName: "readBlob",
            blobId,
        },
        async (event) => {
            const manager = await this.getStorageManager();
            const response = (await manager.getBlob(blobId)).content;
            event.end({
                size: response.size,
            });
            return response;
        },
        undefined, // workers
        this.mc.config.getNumber("Fluid.Driver.ReadBlobTelemetrySampling"),
    );
    this.blobsShaCache.set(value.sha, "");
    const bufferContent = stringToBuffer(value.content, value.encoding);
    await this.blobCache?.put(this.getCacheKey(value.sha), bufferContent);
    return bufferContent;
}

Logs

znewton commented 3 weeks ago

Since the implementation of readBlob there appropriately uses async/await, I actually doubt that the issue is within readBlob itself. We would expect readBlob to throw its errors, and catching those without throwing would possibly cause problems in some areas that should be retrying at a higher level.

Instead, I suspect the problem is within PrefetchDocumentStorageService, which attempts to pre-load the snapshot as soon as the version is requested, potentially saving a couple seconds on load. Within PrefetchDocumentStorageService, there are calls to readBlob that use the void prefix, like so:

        private prefetchTree(tree: ISnapshotTree) {
        const secondary: string[] = [];
        this.prefetchTreeCore(tree, secondary);

        for (const blob of secondary) {
            // We don't care if the prefetch succeeds
            void this.cachedRead(blob);
        }
    }

While the comment is correct, we don't care if the prefetch succeeds, an error there would throw an UncaughtException.

Proof Script:

/**
 * Creates a rejected promise.
 */
async function throwErrorInPromise(id: string): Promise<never> {
  throw new Error(`Error - ${id}`);
}

(async function main(): Promise<void> {
  process.on("uncaughtException", (error: Error) => {
    // Log, instead of die, on uncaught exceptions.
    console.error(`Uncaught exception: ${error.message}`);
  });

  process.on("unhandledRejection", (error: Error) => {
    // Log, instead of die, on uncaught rejections.
    console.error(`Unhandled rejection: ${error.message}`);
  });

  // This should result in an Unhandled Rejection
  // which will be caught by the process.on("unhandledRejection") handler
  // and logged as "Unhandled rejection: Error - void call"
  void throwErrorInPromise("void call");
  // This should just log the error as "Caught error: Error - catch call"
  throwErrorInPromise("catch call").catch((error: Error) => {
    console.error(`Caught error: ${error.message}`);
  });
  // This should just log the error as "Try..catch error: Error - await call"
  try {
    await throwErrorInPromise("await call");
  } catch (error: unknown) {
    console.error(
      `Try..catch error: ${error instanceof Error ? error.message : error}`,
    );
  }
})();

Output

Caught error: Error - catch call
Try..catch error: Error - await call
Unhandled rejection: Error - void call
znewton commented 3 weeks ago

@d2e to mitigate the issue on your end until a fix is released, you can try setting the enablePrefetch driver policy to false when instantiating the RouterliciousDocumentServiceFactory in your app, like

new RouterliciousDocumentServiceFactory(myTokenProvider, { enablePrefetch: false });

Let me know if you can/can't repro after making that change. Thanks!