Closed anitarua closed 2 months ago
Also manually tested against stubbed storage service and verified that it will end after 5 seconds with a CANCELLED error if the server doesn't return anything.
Out of curiosity, how did you do this? It might be a really valuable to document / share with the team.
alternately, we could try tuning the retry options in the grpc service config channel option?
except they don't recommend using this with a retry interceptor, and without the interceptor setup we currently have, i don't think we'd have as fine-grained control over when to retry and for what methods (e.g. inspect metadata to determine retries)
alternately, we could try tuning the retry options in the grpc service config channel option?
* https://grpc.io/docs/guides/retry/ * https://github.com/grpc/grpc-node/blob/master/packages/grpc-js/README.md#supported-channel-options
except they don't recommend using this with a retry interceptor, and without the interceptor setup we currently have, i don't think we'd have as fine-grained control over when to retry and for what methods (e.g. inspect metadata to determine retries)
Correct, we've ruled those out in the past because they don't give us enough control.
tested this locally and the behavior looks about right now - thanks!
one functional thing I think we are still missing is jitter: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
you can just do something like:
const jitter = (Math.random() * (1.2 * delay)) - (1.1 * delay)
I think that will give you somewhere between -10% and + 10% jitter.
will review the code more closely in a moment but I think there are still some API things we should address here before we consider merging. Here are some rough thoughts, happy to discuss more concrete ideas/proposals:
RetryStrategy
interface generic, we can't really give them a clean path to do config.getRetryStrategy().withBlah()
, they will have to build a retry strategy object up on their own. Which is fine but that means we want to make that as easy as possible and maybe even provide a factory function for it.DefaultStorageRetryStrategy
) is really specific to storage use cases. I think we can give it a generic name and theoretically allow it to be used with other clients in the future. So we just need to come up with a name... for the other one we used FixedCountRetryStrategy
, for this one I think it could maybe be called RetryWithinDeadlineStrategy
or something like that, but I'm open to other ideas.Note: it's going to be important that the computation for whether or not we can still retry lives inside of the retry strategy class itself; I think right now you might have it plumbed directly into the client. If you need to add more metadata to DetermineWhenToRetryRequestProps
in order to allow the retry strategy to compute whether we can still retry or not, that would be fine.
let me know if you have any questions or want to zoom to brainstorm on any of this.
cc: @malandis if you have any thoughts on the names I floated, or in general the UX of a user explicitly configuring their strategy rather than using the default, would love to hear ideas
- I don't think this retry strategy (which you've currently called
DefaultStorageRetryStrategy
) is really specific to storage use cases. I think we can give it a generic name and theoretically allow it to be used with other clients in the future. So we just need to come up with a name... for the other one we usedFixedCountRetryStrategy
, for this one I think it could maybe be calledRetryWithinDeadlineStrategy
or something like that, but I'm open to other ideas.
I kinda think RetryStrategy
should remain together as a phrase though. Some additional ideas:
- I don't think this retry strategy (which you've currently called
DefaultStorageRetryStrategy
) is really specific to storage use cases. I think we can give it a generic name and theoretically allow it to be used with other clients in the future. So we just need to come up with a name... for the other one we usedFixedCountRetryStrategy
, for this one I think it could maybe be calledRetryWithinDeadlineStrategy
or something like that, but I'm open to other ideas.I kinda think
RetryStrategy
should remain together as a phrase though. Some additional ideas:* FixedDeadlineRetryStrategy * TimeIntervalRetryStrategy * IntervalRetryStrategy
good call, I agree with you about keeping the words RetryStrategy
together. FixedDeadlineRetryStrategy
or maybe just FixedTimeoutRetryStrategy
are probably my favorite names so far but I'm not swooning over either of them. Will keep thinking on it, and also maybe @malandis has good ideas. But hopefully that's a minor change that we can do at the end and doesn't block your work to refactor the logic into the strategy class.
FixedDeadlineRetryStrategy
or FixedTimeoutRetryStrategy
sounds good to me.
I think we may also want to rename the interceptor that goes with it from StorageClientTimeoutInterceptor
to something similar. FixedDeadlineInterceptor
or FixedTimeoutInterceptor
?
FixedDeadlineRetryStrategy
orFixedTimeoutRetryStrategy
sounds good to me.I think we may also want to rename the interceptor that goes with it from
StorageClientTimeoutInterceptor
to something similar.FixedDeadlineInterceptor
orFixedTimeoutInterceptor
?
oh, I didn't realize you had to create a custom interceptor but that makes sense in retrospect.
The ideal scenario would be that we only have one RetryInterceptor, and that we have augmented the RetryStrategy
interface enough for that single interceptor to work with all the different retry strategies.
If we can't figure out how to do that, then we're in a situation where users can't swap in a different RetryStrategy
without also changing the interceptor, and that will make our API/UX a lot more burdensome/confusing. So if we can't figure out how to pull off a single shared interceptor then we should push pause and talk through options.
happy to pair on what the refactoring of the shared interceptor would need to look like in order for it to successfully accommodate multiple strategies.
FixedDeadlineRetryStrategy
orFixedTimeoutRetryStrategy
sounds good to me.I think we may also want to rename the interceptor that goes with it from
StorageClientTimeoutInterceptor
to something similar.FixedDeadlineInterceptor
orFixedTimeoutInterceptor
?
checking back in on this: I think I'm leaning FixedTimeoutRetryStrategy
, so we will have that one and FixedCountRetryStrategy
as the two pre-existing ones. @malandis if you have any other ideas here lmk.
I think I've addressed all the feedback so far since the last major revision:
RetryUntilTimeoutInterceptor
implementation (was messy from trying to debug the retry interceptor bug)responseDataReceivedTimeoutMillis
to the RetryStrategy
interface and ClientTimeoutInterceptor
checks for existence of that instead of the instanceof
checkwhenToRetry
valueFixedTimeoutRetryStrategy
classMy remaining question is whether to revert some renamings in transport/grpc configs.
Any other questions/concerns on your end, @cprice404 and @malandis ?
Test is failing -- discovered that the interceptor is not resetting the overall deadline between different successful requests. Currently revisiting the design of the fixed timeout interceptor.
I pulled this down and tested it with this code:
const loggerFactory = new DefaultMomentoLoggerFactory(DefaultMomentoLoggerLevel.TRACE);
const retryStrategy = new FixedTimeoutRetryStrategy({
loggerFactory: loggerFactory,
responseDataReceivedTimeoutMillis: 100,
retryDelayIntervalMillis: 1000,
});
const storageClient = new PreviewStorageClient({
configuration: StorageConfigurations.Laptop.latest(loggerFactory).withRetryStrategy(retryStrategy),
credentialProvider: CredentialProvider.fromEnvironmentVariable('MOMENTO_API_KEY').withMomentoLocal(),
});
My thoughts:
attemptNumber: 6
and 7
. That threw me off for a second because I was only expecting 5 requests. Then I watched the server-side logs and confirmed it was actually only 5 requests, so I think it's just a slightly unexpected path through the retry code that causes those last two sets of logs to get printed, even though the request isn't actually being issued again. This is not really a big deal, not worth spending a ton of time on, but if you are seeing this in your logs too and you see any easy way to clear up those last two chunks of logs that'd be bonus points.I patched @anitarua 's latest version to combine the ClientTimeoutInterceptor
into the RetryInterceptor
. They were starting to have too much overlap, and the RetryInterceptor was already set up in a way that made it easy to create local variables per-request to keep track of overall deadlines, and just close over those variables with the higher-order functions. This solves the problem where we were trying to maintain that state as a member variable on the ClientTimeoutInterceptor class.
No public API changes from Anita's version.
Tested locally using the program below; I believe this is ready for review/merge now:
import {
CreateStoreResponse,
CredentialProvider,
DefaultMomentoLoggerFactory,
DefaultMomentoLoggerLevel,
FixedTimeoutRetryStrategy,
PreviewStorageClient,
StorageConfigurations,
StorageGetResponse,
StoragePutResponse,
} from '@gomomento/sdk';
async function startWorker(storageClient: PreviewStorageClient, id: number) {
await delay(Math.random() * 1000);
const storeName = 'my-store';
const putResponse = await storageClient.putString(storeName, `test-key-${id}`, `test-value-${id}`);
switch (putResponse.type) {
case StoragePutResponse.Success:
console.log(`Key 'test-key-${id}' stored successfully`);
break;
case StoragePutResponse.Error:
console.log(`Error occurred when trying to store key 'test-key-${id}': ${putResponse.toString()}`);
break;
// throw new Error(
// `An error occurred while attempting to store key 'test-key-${id}' in store '${storeName}': ${putResponse.errorCode()}: ${putResponse.toString()}`
// );
}
const getResponse = await storageClient.get(storeName, `test-key-${id}`);
// pattern-matching style; safer for production code
switch (getResponse.type) {
case StorageGetResponse.Found:
// if you know the value is a string:
console.log(`Retrieved value for key 'test-key-${id}': ${getResponse.value().string()!}`);
break;
case StorageGetResponse.NotFound:
console.log(`Key 'test-key-${id}' was not found in store '${storeName}'`);
break;
case StorageGetResponse.Error:
console.log(`Error occurred when trying to get key 'test-key-${id}': ${getResponse.toString()}`);
// throw new Error(
// `An error occurred while attempting to get key 'test-key' from store '${storeName}': ${getResponse.errorCode()}: ${getResponse.toString()}`
// );
}
}
async function main() {
const loggerFactory = new DefaultMomentoLoggerFactory(DefaultMomentoLoggerLevel.TRACE);
const retryStrategy = new FixedTimeoutRetryStrategy({
loggerFactory: loggerFactory,
responseDataReceivedTimeoutMillis: 1000,
retryDelayIntervalMillis: 10,
});
const storageClient = new PreviewStorageClient({
configuration: StorageConfigurations.Laptop.latest(loggerFactory).withRetryStrategy(retryStrategy),
credentialProvider: CredentialProvider.fromEnvironmentVariable('MOMENTO_API_KEY').withMomentoLocal(),
});
const storeName = 'my-store';
const createStoreResponse = await storageClient.createStore(storeName);
switch (createStoreResponse.type) {
case CreateStoreResponse.AlreadyExists:
console.log(`Store '${storeName}' already exists`);
break;
case CreateStoreResponse.Success:
console.log(`Store '${storeName}' created`);
break;
case CreateStoreResponse.Error:
throw new Error(
`An error occurred while attempting to create store '${storeName}': ${createStoreResponse.errorCode()}: ${createStoreResponse.toString()}`
);
}
const workerPromises = range(5).map(i => startWorker(storageClient, i));
await Promise.all(workerPromises);
}
function range(n: number): Array<number> {
return [...Array(n).keys()];
}
function delay(ms: number) {
return new Promise(resolve => setTimeout(resolve, ms));
}
main().catch(e => {
console.error('An error occurred! ', e);
throw e;
});
cc: @malandis may want to get your eyes on this when you have some time.
Ran the topic subscribe example with this code; turned off wifi, observed the client detect the connection issue and go into the retry loop; when i turned the wifi back on it reconnected successfully.
Closes https://github.com/momentohq/dev-eco-issue-tracker/issues/878
Prototyping implementation of storage client retry strategy. Uses a new interceptor to create timeouts and retry every 1 second that data is not received, up to a 5s overall timeout.
Added some unit tests against the new storage retry strategy class and the changes to the storage client config. Also manually tested against stubbed storage service and verified that it will end after 5 seconds with a CANCELLED error if the server doesn't return anything.