feat: add retry strategy to storage client

anitarua commented 3 months ago

Closes https://github.com/momentohq/dev-eco-issue-tracker/issues/878

Prototyping implementation of storage client retry strategy. Uses a new interceptor to create timeouts and retry every 1 second that data is not received, up to a 5s overall timeout.

Added some unit tests against the new storage retry strategy class and the changes to the storage client config. Also manually tested against stubbed storage service and verified that it will end after 5 seconds with a CANCELLED error if the server doesn't return anything.

cprice404 commented 3 months ago

Also manually tested against stubbed storage service and verified that it will end after 5 seconds with a CANCELLED error if the server doesn't return anything.

Out of curiosity, how did you do this? It might be a really valuable to document / share with the team.

anitarua commented 3 months ago

alternately, we could try tuning the retry options in the grpc service config channel option?

except they don't recommend using this with a retry interceptor, and without the interceptor setup we currently have, i don't think we'd have as fine-grained control over when to retry and for what methods (e.g. inspect metadata to determine retries)

cprice404 commented 3 months ago

alternately, we could try tuning the retry options in the grpc service config channel option?
* https://grpc.io/docs/guides/retry/

* https://github.com/grpc/grpc-node/blob/master/packages/grpc-js/README.md#supported-channel-options
except they don't recommend using this with a retry interceptor, and without the interceptor setup we currently have, i don't think we'd have as fine-grained control over when to retry and for what methods (e.g. inspect metadata to determine retries)

Correct, we've ruled those out in the past because they don't give us enough control.

cprice404 commented 2 months ago

tested this locally and the behavior looks about right now - thanks!

one functional thing I think we are still missing is jitter: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

you can just do something like:

const jitter = (Math.random() * (1.2 * delay)) - (1.1 * delay)

I think that will give you somewhere between -10% and + 10% jitter.

will review the code more closely in a moment but I think there are still some API things we should address here before we consider merging. Here are some rough thoughts, happy to discuss more concrete ideas/proposals:

I think the responseDataReceivedTimeoutMillis should be part of the retry strategy, rather than part of the static grpc config. it's only ever relevant / used by the retry strategy, right? I realize that you might need to add an accessor function to the RetryStrategy interface in order to make this work but that feels right to me?
we should be thinking about how a user would override this configuration if they wanted to. Probably because we want to keep the RetryStrategy interface generic, we can't really give them a clean path to do config.getRetryStrategy().withBlah(), they will have to build a retry strategy object up on their own. Which is fine but that means we want to make that as easy as possible and maybe even provide a factory function for it.
I don't think this retry strategy (which you've currently called DefaultStorageRetryStrategy) is really specific to storage use cases. I think we can give it a generic name and theoretically allow it to be used with other clients in the future. So we just need to come up with a name... for the other one we used FixedCountRetryStrategy, for this one I think it could maybe be called RetryWithinDeadlineStrategy or something like that, but I'm open to other ideas.

Note: it's going to be important that the computation for whether or not we can still retry lives inside of the retry strategy class itself; I think right now you might have it plumbed directly into the client. If you need to add more metadata to DetermineWhenToRetryRequestProps in order to allow the retry strategy to compute whether we can still retry or not, that would be fine.

let me know if you have any questions or want to zoom to brainstorm on any of this.

cc: @malandis if you have any thoughts on the names I floated, or in general the UX of a user explicitly configuring their strategy rather than using the default, would love to hear ideas

anitarua commented 2 months ago

I don't think this retry strategy (which you've currently called DefaultStorageRetryStrategy) is really specific to storage use cases. I think we can give it a generic name and theoretically allow it to be used with other clients in the future. So we just need to come up with a name... for the other one we used FixedCountRetryStrategy, for this one I think it could maybe be called RetryWithinDeadlineStrategy or something like that, but I'm open to other ideas.

I kinda think RetryStrategy should remain together as a phrase though. Some additional ideas:

FixedDeadlineRetryStrategy
TimeIntervalRetryStrategy
IntervalRetryStrategy

cprice404 commented 2 months ago

I don't think this retry strategy (which you've currently called DefaultStorageRetryStrategy) is really specific to storage use cases. I think we can give it a generic name and theoretically allow it to be used with other clients in the future. So we just need to come up with a name... for the other one we used FixedCountRetryStrategy, for this one I think it could maybe be called RetryWithinDeadlineStrategy or something like that, but I'm open to other ideas.

I kinda think RetryStrategy should remain together as a phrase though. Some additional ideas:
* FixedDeadlineRetryStrategy

* TimeIntervalRetryStrategy

* IntervalRetryStrategy

good call, I agree with you about keeping the words RetryStrategy together. FixedDeadlineRetryStrategy or maybe just FixedTimeoutRetryStrategy are probably my favorite names so far but I'm not swooning over either of them. Will keep thinking on it, and also maybe @malandis has good ideas. But hopefully that's a minor change that we can do at the end and doesn't block your work to refactor the logic into the strategy class.

anitarua commented 2 months ago

FixedDeadlineRetryStrategy or FixedTimeoutRetryStrategy sounds good to me.

I think we may also want to rename the interceptor that goes with it from StorageClientTimeoutInterceptor to something similar. FixedDeadlineInterceptor or FixedTimeoutInterceptor?

cprice404 commented 2 months ago

FixedDeadlineRetryStrategy or FixedTimeoutRetryStrategy sounds good to me.

I think we may also want to rename the interceptor that goes with it from StorageClientTimeoutInterceptor to something similar. FixedDeadlineInterceptor or FixedTimeoutInterceptor?

oh, I didn't realize you had to create a custom interceptor but that makes sense in retrospect.

The ideal scenario would be that we only have one RetryInterceptor, and that we have augmented the RetryStrategy interface enough for that single interceptor to work with all the different retry strategies.

If we can't figure out how to do that, then we're in a situation where users can't swap in a different RetryStrategy without also changing the interceptor, and that will make our API/UX a lot more burdensome/confusing. So if we can't figure out how to pull off a single shared interceptor then we should push pause and talk through options.

happy to pair on what the refactoring of the shared interceptor would need to look like in order for it to successfully accommodate multiple strategies.

cprice404 commented 2 months ago

FixedDeadlineRetryStrategy or FixedTimeoutRetryStrategy sounds good to me.

I think we may also want to rename the interceptor that goes with it from StorageClientTimeoutInterceptor to something similar. FixedDeadlineInterceptor or FixedTimeoutInterceptor?

checking back in on this: I think I'm leaning FixedTimeoutRetryStrategy, so we will have that one and FixedCountRetryStrategy as the two pre-existing ones. @malandis if you have any other ideas here lmk.

anitarua commented 2 months ago

I think I've addressed all the feedback so far since the last major revision:

simplified the RetryUntilTimeoutInterceptor implementation (was messy from trying to debug the retry interceptor bug)
added the optional property responseDataReceivedTimeoutMillis to the RetryStrategy interface and ClientTimeoutInterceptor checks for existence of that instead of the instanceof check
updated jitter formula to return between 0.9 to 1.1 times the original whenToRetry value
moved jitter addition into the renamed FixedTimeoutRetryStrategy class
no longer export the timeout interceptor classes

My remaining question is whether to revert some renamings in transport/grpc configs.

Any other questions/concerns on your end, @cprice404 and @malandis ?

anitarua commented 2 months ago

Test is failing -- discovered that the interceptor is not resetting the overall deadline between different successful requests. Currently revisiting the design of the fixed timeout interceptor.

cprice404 commented 2 months ago

I pulled this down and tested it with this code:

const loggerFactory = new DefaultMomentoLoggerFactory(DefaultMomentoLoggerLevel.TRACE);

  const retryStrategy = new FixedTimeoutRetryStrategy({
    loggerFactory: loggerFactory,
    responseDataReceivedTimeoutMillis: 100,
    retryDelayIntervalMillis: 1000,
  });

  const storageClient = new PreviewStorageClient({
    configuration: StorageConfigurations.Laptop.latest(loggerFactory).withRetryStrategy(retryStrategy),
    credentialProvider: CredentialProvider.fromEnvironmentVariable('MOMENTO_API_KEY').withMomentoLocal(),
  });

My thoughts:

I am good with this UX for overriding the retry strategy. Feels reasonable to me. Nice job!
Anecdotally it seemed like it was probably doing the right thing; I saw 5 requests get through to the server which is what I was expecting.
It'd be nice if the trace-level logs in the retrystrategy/timeout interceptor gave a bit more timing info; i'd like to see my actual configured values (100, 1000) appear in those logs so it'd be easy to mentally connect them back to my config.
Also in the logs - when we fail and give up, it shows up in the logs as attemptNumber: 6 and 7. That threw me off for a second because I was only expecting 5 requests. Then I watched the server-side logs and confirmed it was actually only 5 requests, so I think it's just a slightly unexpected path through the retry code that causes those last two sets of logs to get printed, even though the request isn't actually being issued again. This is not really a big deal, not worth spending a ton of time on, but if you are seeing this in your logs too and you see any easy way to clear up those last two chunks of logs that'd be bonus points.

cprice404 commented 2 months ago

I patched @anitarua 's latest version to combine the ClientTimeoutInterceptor into the RetryInterceptor. They were starting to have too much overlap, and the RetryInterceptor was already set up in a way that made it easy to create local variables per-request to keep track of overall deadlines, and just close over those variables with the higher-order functions. This solves the problem where we were trying to maintain that state as a member variable on the ClientTimeoutInterceptor class.

No public API changes from Anita's version.

Tested locally using the program below; I believe this is ready for review/merge now:

import {
  CreateStoreResponse,
  CredentialProvider,
  DefaultMomentoLoggerFactory,
  DefaultMomentoLoggerLevel,
  FixedTimeoutRetryStrategy,
  PreviewStorageClient,
  StorageConfigurations,
  StorageGetResponse,
  StoragePutResponse,
} from '@gomomento/sdk';

async function startWorker(storageClient: PreviewStorageClient, id: number) {
  await delay(Math.random() * 1000);
  const storeName = 'my-store';
  const putResponse = await storageClient.putString(storeName, `test-key-${id}`, `test-value-${id}`);
  switch (putResponse.type) {
    case StoragePutResponse.Success:
      console.log(`Key 'test-key-${id}' stored successfully`);
      break;
    case StoragePutResponse.Error:
      console.log(`Error occurred when trying to store key 'test-key-${id}': ${putResponse.toString()}`);
      break;
    // throw new Error(
    //   `An error occurred while attempting to store key 'test-key-${id}' in store '${storeName}': ${putResponse.errorCode()}: ${putResponse.toString()}`
    // );
  }

  const getResponse = await storageClient.get(storeName, `test-key-${id}`);
  // pattern-matching style; safer for production code
  switch (getResponse.type) {
    case StorageGetResponse.Found:
      // if you know the value is a string:
      console.log(`Retrieved value for key 'test-key-${id}': ${getResponse.value().string()!}`);
      break;
    case StorageGetResponse.NotFound:
      console.log(`Key 'test-key-${id}' was not found in store '${storeName}'`);
      break;
    case StorageGetResponse.Error:
      console.log(`Error occurred when trying to get key 'test-key-${id}': ${getResponse.toString()}`);
    // throw new Error(
    //   `An error occurred while attempting to get key 'test-key' from store '${storeName}': ${getResponse.errorCode()}: ${getResponse.toString()}`
    // );
  }
}

async function main() {
  const loggerFactory = new DefaultMomentoLoggerFactory(DefaultMomentoLoggerLevel.TRACE);

  const retryStrategy = new FixedTimeoutRetryStrategy({
    loggerFactory: loggerFactory,
    responseDataReceivedTimeoutMillis: 1000,
    retryDelayIntervalMillis: 10,
  });

  const storageClient = new PreviewStorageClient({
    configuration: StorageConfigurations.Laptop.latest(loggerFactory).withRetryStrategy(retryStrategy),
    credentialProvider: CredentialProvider.fromEnvironmentVariable('MOMENTO_API_KEY').withMomentoLocal(),
  });

  const storeName = 'my-store';
  const createStoreResponse = await storageClient.createStore(storeName);
  switch (createStoreResponse.type) {
    case CreateStoreResponse.AlreadyExists:
      console.log(`Store '${storeName}' already exists`);
      break;
    case CreateStoreResponse.Success:
      console.log(`Store '${storeName}' created`);
      break;
    case CreateStoreResponse.Error:
      throw new Error(
        `An error occurred while attempting to create store '${storeName}': ${createStoreResponse.errorCode()}: ${createStoreResponse.toString()}`
      );
  }

  const workerPromises = range(5).map(i => startWorker(storageClient, i));
  await Promise.all(workerPromises);
}

function range(n: number): Array<number> {
  return [...Array(n).keys()];
}

function delay(ms: number) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

main().catch(e => {
  console.error('An error occurred! ', e);
  throw e;
});

cprice404 commented 2 months ago

cc: @malandis may want to get your eyes on this when you have some time.

cprice404 commented 2 months ago

Ran the topic subscribe example with this code; turned off wifi, observed the client detect the connection issue and go into the retry loop; when i turned the wifi back on it reconnected successfully.

momentohq / client-sdk-javascript

feat: add retry strategy to storage client #1396