Closed sreeise closed 3 months ago
@DevLazio Created this for the retry feature we talked about. I listed a few things that I thought would be good objectives for this functionality and some of the things that typically good retry implementations have.
I found a few interesting crates. The first is the backoff crate which looks promising and provides mechanisms for doing retries and has reqwest examples.
The next is reqwest-retry which is a wrapper around reqwest that provides retry functionality. I don't want to use this crate mainly because it wraps the reqwest Client and provides its own interface to do requests which means you lose any choice of setting reqwest version, or using any features that this crate hasnt adopted yet. But its something to look at with regards to how a retry implementation could be done or some functionality can be done.
This is the design document for retry handlers in Microsofts SDKs and I thinking following this specification will probably work out best. https://github.com/microsoftgraph/msgraph-sdk-design/blob/master/middleware/RetryHandler.md
Kiota is Microsofts code generation for different SDK's. They have good examples of Retry objects implemented with different scenarios and properties.
https://github.com/microsoft/kiota-http-dotnet/blob/main/src/Middleware/RetryHandler.cs https://github.com/microsoft/kiota-http-dotnet/blob/main/src/Middleware/Options/RetryHandlerOption.cs
Thanks, will take a look at this !
I finally had a bit of time to look into this. The backoff crate looks like the right choice for the task. The only pain point is using an async FnMut function so I had to do a bit of gymnastics around it.
Nothing is final, just looking for comment at this time.
Usage :
let config = GraphClientConfiguration::new()
.retries(true)
.retries_config(RetriesConfig {
initial_interval: Duration::from_millis(500),
randomization_factor: 0.0,
multiplier: 1.5,
max_interval: Duration::from_secs(450),
max_elapsed_time: Some(Duration::from_millis(450)),
})
.access_token(ACCESS_TOKEN);
let client = Graph::from(config);
for _i in 0..100 {
let response = client
.reports()
.get_email_activity_counts_by_period("D180")
.send()
.await.unwrap();
println!("Got response {}", response.status());
}
It's not necessary to provide a RetriesConfig, there is already a default one with values that seems okay enough for GraphAPI imo.
I finally had a bit of time to look into this. The backoff crate looks like the right choice for the task. The only pain point is using an async FnMut function so I had to do a bit of gymnastics around it.
Nothing is final, just looking for comment at this time.
Usage :
let config = GraphClientConfiguration::new() .retries(true) .retries_config(RetriesConfig { initial_interval: Duration::from_millis(500), randomization_factor: 0.0, multiplier: 1.5, max_interval: Duration::from_secs(450), max_elapsed_time: Some(Duration::from_millis(450)), }) .access_token(ACCESS_TOKEN); let client = Graph::from(config); for _i in 0..100 { let response = client .reports() .get_email_activity_counts_by_period("D180") .send() .await.unwrap(); println!("Got response {}", response.status()); }
It's not necessary to provide a RetriesConfig, there is already a default one with values that seems okay enough for GraphAPI imo.
I guess I would say can you give me more information on how it would be used? Is the ranomization factor based off of the backoff crate? I didn't go too far into looking at the backoff crate about its use so I apologize. Whats confusing about that is typically exponential backoff increases the rate at which the backoff happens. And this is typically done at a 2x factor. For instance first retry happens at 2 then 4 then 8 then 16. This is how its done in other graph sdks as well.
For retry functionality we should allow providing policies such as ExponentialBackoffPolicy
where the user can set certain fields that change how the retry is done whereas fields like max retry time or max retries allowed will be set internally. The user can still set those but only if they don't go over the given limit. And that limit would be based off of other graph sdks to be consistent.
The unoffical Rust Azure sdk is similar but with the possibility of adding a random delay: https://github.com/Azure/azure-sdk-for-rust/blob/main/sdk/core/src/policies/retry_policies/exponential_retry.rs - I wonder the randomization factor is for the same thing?
I guess I would say can you give me more information on how it would be used? Is the ranomization factor based off of the backoff crate? I didn't go too far into looking at the backoff crate about its use so I apologize. Whats confusing about that is typically exponential backoff increases the rate at which the backoff happens. And this is typically done at a 2x factor. For instance first retry happens at 2 then 4 then 8 then 16. This is how its done in other graph sdks as well.
The five parameters are from the backoff crate. At first I was thinking that we could provide a simpler configuration then I realised that at some point we would get a ticket to add such and such parameter and "re-exported" them all.
For retry functionality we should allow providing policies such as ExponentialBackoffPolicy where the user can set certain fields that change how the retry is done whereas fields like max retry time or max retries allowed will be set internally. The user can still set those but only if they don't go over the given limit. And that limit would be based off of other graph sdks to be consistent.
There is an open issue about adding a max retry count, but currently the backoff crate does not provide that functionnality. I guess we can compute it by modifying the time parameters, but it's not the same. If we do this we have to implement it ourselves.
The unoffical Rust Azure sdk is similar but with the possibility of adding a random delay: https://github.com/Azure/azure-sdk-for-rust/blob/main/sdk/core/src/policies/retry_policies/exponential_retry.rs - I wonder the randomization factor is for the same thing? As I understand it, pretty much.
@DevLazio Yeah probably the best thing to do here is to provide a configuration model, a wrapper around the backoff config, that has those fields:
For the max retries, that is something that will have to be done on our side. A counter of some sort.
The configuration model will need to be done in such as a way as to allow other types of retry configuration models such as a throttle retry configuration. This can be done by providing a mechamism to allow multiple types or configurations such as an enum where each enum field is one of the configurations or by using a trait object and the GraphConfiguration
storing a Box<dyn trait>
for those configurations.
struct ThrottleConfig;
#[derive(Clone, Debug)]
pub enum Retries {
NoRetry,
ExponentialBackoff(ExponentialBackoffConfig),
Throttle(ThrottleConfig),
}
#[derive(Clone)]
struct ClientConfiguration {
access_token: Option<String>,
headers: HeaderMap,
referer: bool,
timeout: Option<Duration>,
retries: Retries,
connect_timeout: Option<Duration>,
connection_verbose: bool,
https_only: bool,
min_tls_version: Version,
}
Something like that ?
For the max retries I can't find a way to implement an attempt counter, neither inside the closure nor outside. I think the only way is to implement it inside the ExponentialBackoff struct (see https://github.com/ihrwein/backoff/pull/60).
I'm curious, what would the Throttle retry do ?
@DevLazio I havent forgotten about this. Sorry for the delay. Just been super busy between work and other things. I have a design that I think will work for this and I can give a high level overview and we can discuss. Just give me a few more days and i'll reply back with that design information.
I'm curious, what would the Throttle retry do ?
The graph api throttles the amount of requests that can be sent to its servers based on a few factors but mainly too many requests being sent at one time. Here is a small excerpt from their docs:
When a throttling threshold is exceeded, Microsoft Graph limits any further requests from that client for a period of time. When throttling occurs, Microsoft Graph returns HTTP status code 429 (Too many requests), and the requests fail. A suggested wait time is returned in the response header of the failed request. Throttling behavior can depend on the type and number of requests. For example, if you have a high volume of requests, all requests types are throttled. Threshold limits vary based on the request type. Therefore, you could encounter a scenario where writes are throttled but reads are still permitted.
When requests are throttled, Microsoft sends the response with the 429 error code, which can be used to detect throttling, as well as a Retry-After header which provides the number of seconds to wait before retrying the request again. The throttle retry will wait those number of seconds and then retry the request. We can possibly add an option for doing these waits on another thread so its non-blocking as well but default would block.
You can read more about it here: https://learn.microsoft.com/en-us/graph/throttling#best-practices-to-handle-throttling
I also added a general summary in the issue description above of the two retry types that I think is a good starting place for getting the code in the library. Meaning as long as those two are met we should merge it in.
@DevLazio
I will have some more information on this soon but in general the implementation should follow that of a http pipeline using policies. That can be a little confusing given its such a broad term but a simple description is that a pipeline stores a list of policies, which are things like retry policies or telementry policies or specific use case policies, and these policies are called in a specific order on each request.
So Policies are not specific to retrying requests but the basics are:
A pipeline:
Retry policies will be included in this list of policies. So this means the retry policy will modify the request in such a way that the transport policy knows how to handle any retries (there are different implementations and ways of doing things but I am providing a simple example). The request that I am talking about may not be just the http Request object but will have other fields, and will have a field for the current actual http request that is required.
I will provide a general impl of this and the retry policies soon. It won't be too in depth but it will provide enough of it to understand what I have mentioned above or at least make it more clear.
I like your pipeline approach. Might be especially usefull when trying to follow the Microsoft RFC for retries and the user's needs.
The graph api throttles the amount of requests that can be sent to its servers based on a few factors but mainly too many requests being sent at one time.
I thought you meant another thing. The way I started experimenting with it, the "429 Too Many requests" and their Retry-After header has priority over the user-defined exponential backoff settings, which resemble your pipeline model. The exponential backoff was used when their was no Retry After header or another error append. Having the Enum be {None, ExponentialBackoff, Throttle} feels weird to me because we kinda have to abide by GraphAPI limitations, (I discovered that for endpoints you can continue to try before the Retry-After period is up and will a bit of luck you can get a successfull response if your request reach another instance or something) so ExponentialBackoff should include Throttle. That's why I think your pipeline model is quite elegant.
We must be carrefull while implementing this pipeline though, because each endpoint has his own throttle limits, so the client must not delay a request because a request to another endpoint got a 429 response.
Ive been quite busy so its taking me longer to get back to you. So I wanted to provide a quick example that follows what other msgraph sdks do as far as a really simple example can.
This is a small example of what typical sdks do but not a thorough or exactly correct one by any means. Note how the policies are taking in a list of policies and calling the next in the list of policies.
Not sure how much youve looked at http pipelines when it comes to azure or graph sdks but they typically take multiple types of policies and its basicaly a modification of the request and then calls the next policy to put it simply.
use http::Request;
use serde_json::Value;
use std::error::Error;
use std::sync::Arc;
pub struct RequestContext {
// ... request context fields
}
// request context impl somewhere
// Just here as an example. The actual struct/impl would be different.
pub struct SomePolicyResult;
pub trait HttpPipelinePolicy {
// Modify the request.
fn process_async(
&self,
context: &RequestContext,
request: &mut http::Request<Value>,
pipeline: &[Arc<dyn HttpPipelinePolicy>],
) -> Result<SomePolicyResult, Box<dyn std::error::Error>>;
fn process_next_async(
&self,
context: &RequestContext,
request: &mut http::Request<Value>,
pipeline: &[Arc<dyn HttpPipelinePolicy>],
) -> Result<SomePolicyResult, Box<dyn std::error::Error>> {
pipeline[0].process_async(context, request, &pipeline[1..])
}
}
// Example only. Not exact at all.
pub struct ExponentialBackoffRetryPolicy {
// ... retry fields
pub min_retries: u32,
}
impl HttpPipelinePolicy for ExponentialBackoffPolicy {
fn process_async(
&self,
context: &RequestContext,
request: &mut Request<Value>,
pipeline: &[Arc<dyn HttpPipelinePolicy>],
) -> Result<SomePolicyResult, Box<dyn Error>> {
// modify request...
Ok(SomePolicyResult)
}
}
pub struct ThrottleRetryPolicy {
// ... impl
}
impl HttpPipelinePolicy for ThrottlePolicy {
fn process_async(
&self,
context: &RequestContext,
request: &mut Request<Value>,
pipeline: &[Arc<dyn HttpPipelinePolicy>],
) -> Result<SomePolicyResult, Box<dyn Error>> {
// modify request...
Ok(SomePolicyResult)
}
}
I was trying to implement the Http Pipeline and I encountered a few problems. First of all, contrary to something like a tower middleware, I don't think it's possible to modify the Http request so that the Transport policy knows how to do retries. And even if we could, that would mean that most of the logic is handled by the Transport policy, leaving the others empty.
In the specific case of retries, each policy has it's own state and based on that the policy can :
There is no need there to modify the request, these policies just need to be able to retry and the Transport policy would basically be the same as the current send() method. Incidentally, the return type of these policies should be the request response.
Do you have an example of some other policies you would want to implement, other than retries ?
I just pushed a commit to implement the HTTP Pipeline. A lot of work is still needed (give user ability to customize pipeline policies, implementing max number of retries, RequestContext, cleanup of unwraps, etc) but the basic POC is there. Let me know what you think of it.
I just pushed a commit to implement the HTTP Pipeline. A lot of work is still needed (give user ability to customize pipeline policies, implementing max number of retries, RequestContext, cleanup of unwraps, etc) but the basic POC is there. Let me know what you think of it.
Hey, apologies my work has been very busy lately and I havn't had time to look at this yet. I will take a look soon.
@DevLazio
Ok so there is a good bit there and I think its a good start. There were a few things I was confused on. One was why there needed to be a struct specifically to clone a pipeline policy. But im sure you had a reason I just couldnt tell what was going on there.
On adding a policy to the pipeline I would say it would make more since to not require them to have it wrapped in an Arc beforehand and just require providing an impl of the trait but I didn't run the code myself so I don't know if there was a reason for that specifically.
pub fn add_pipeline_policy(
mut self,
policy: Arc<dyn HttpPipelinePolicy + Send + Sync>,
) -> GraphClientConfiguration {
self.config.pipeline.push(policy);
self
}
I think in general it looks good. I would say there probably is some areas of concern. Such as why does the pipline take a Client. My assumption is that the underlying owner of the pipeline array of policies is also the owner of the client itself and would provide that itself. Because if we provide a way to allow the user to create new policies, ecspecially ones where they may do some type of manipulation pre/post request then they would probably also be implementing the HttpPipelinePolicy trait and I don't know whether or not its a good idea to give them access to the client. Maybe it is im not sure. I will have to think about that one some more and some of the possible benefits/drawbacks.
@DevLazio Im also interested to know if you have ever look at Tower and the middleware use cases that their library brings. Both Reqwest and several other clients have added integrations for Tower.
They have middleware integrations that allow doing a very similar thing to what I described with the Http Pipeline Policies. And they even have retrty policies baked in. Although you still would have to define more policies for this specific use case but there is benefit to using Tower over a custom built one.
If you get a chance take a look at their main crate docs below and just read the first couple of paragraphs to see what I mean. I have looked at it a little bit and have tested doing an integration but didn't get too far into it.
Tower main crate docs: https://docs.rs/tower/latest/tower/index.html
Example doc of Tower Retry: https://docs.rs/tower/latest/tower/retry/trait.Policy.html
Sorry It took me so long to reply, was in vacations.
Ok so there is a good bit there and I think its a good start. There were a few things I was confused on. One was why there needed to be a struct specifically to clone a pipeline policy. But im sure you had a reason I just couldnt tell what was going on there.
It was needed to clone a struct containing a Box
On adding a policy to the pipeline I would say it would make more since to not require them to have it wrapped in an Arc beforehand and just require providing an impl of the trait but I didn't run the code myself so I don't know if there was a reason for that specifically.
I started with your example but could not make it to compile without changing that.
I think in general it looks good. I would say there probably is some areas of concern. Such as why does the pipline take a Client. My assumption is that the underlying owner of the pipeline array of policies is also the owner of the client itself and would provide that itself. Because if we provide a way to allow the user to create new policies, ecspecially ones where they may do some type of manipulation pre/post request then they would probably also be implementing the HttpPipelinePolicy trait and I don't know whether or not its a good idea to give them access to the client. Maybe it is im not sure. I will have to think about that one some more and some of the possible benefits/drawbacks.
It boils down to ownership issues. Because I want to carry the request between pipelines I need to bring the client with it.
@DevLazio Im also interested to know if you have ever look at Tower and the middleware use cases that their library brings. Both Reqwest and several other clients have added integrations for Tower.
They have middleware integrations that allow doing a very similar thing to what I described with the Http Pipeline Policies. And they even have retrty policies baked in. Although you still would have to define more policies for this specific use case but there is benefit to using Tower over a custom built one.
If you get a chance take a look at their main crate docs below and just read the first couple of paragraphs to see what I mean. I have looked at it a little bit and have tested doing an integration but didn't get too far into it.
Tower main crate docs: https://docs.rs/tower/latest/tower/index.html
Example doc of Tower Retry: https://docs.rs/tower/latest/tower/retry/trait.Policy.html
That was actually one of the first thing I look when I started the implementation because I use Axum with a custom middleware for a project. I don't think it's possible to use a Tower Middleware as-is inside graph-rs. So I looked into the implementation and found it way too complicated for what we needed.
That was actually one of the first thing I look when I started the implementation because I use Axum with a custom middleware for a project. I don't think it's possible to use a Tower Middleware as-is inside graph-rs. So I looked into the implementation and found it way too complicated for what we needed.
What was the issue with Tower? I mean I get that things can get complicated and its not always necessary to implement more complex features.
I'll try to explain my reasoning for looking at Tower and possibly working with that crate.
In the future I would like to have the ability to use different http request crates such as hyper instead of reqwest and others. And the logic to implement http policies and a pipline would obviously have to take into account the different types of http clients that could be used. So instead of duplicating the code and rewriting each part for each new client it would be easier to have a system in place that can work for any client regardless. Tower does this pretty decently well as far as I can tell and Tower integrations are often provided for these clients.
Really its about having an abstraction in place to handle these two features, pipline policies and multi http client support, in a way that works well and doesn't require overly complex (possibly somewhat complex), and problematic code.
Now how this is done is really up to how we write the code. We could potentially come up with an abstraction that does this that is simpler than Tower - but how much simpler I don't know. So my thinking is if we can use something like Tower than it would be beneficial because it solves both problems of needing http retry policies and being able to use different http clients.
So I am really just wondering if there are issues that you ran into attempting to use Tower or if the complexity of it was just something that turned you away from it quickly. I understand either way. Just trying to gather more information.
I started with your example but could not make it to compile without changing that.
It was needed to clone a struct containing a Box : https://stackoverflow.com/questions/30353462/how-to-clone-a-struct-storing-a-boxed-trait-object I've since iterated on the design and I don't think it's needed anymore, will remove it.
Yeah I mean I didn't run the code myself. I wrote it quickly to give an example.
I think the point of the Arc was for the cloning. Either way you said that was not needed anymore but just wanted to mention that based on the article you linked. Was that the reason the method required the trait impl be wrapped in Arc to add it?
It boils down to ownership issues. Because I want to carry the request between pipelines I need to bring the client with it.
I am not understanding why that would be needed though? The client is needed when the request is sent. The manipulation or creation of a policy shouldn't need the client. The policy will of course be used by the client at a later point in time but that doesn't mean its needed within this context. Maybe im failing to understand what your saying though?
That was actually one of the first thing I look when I started the implementation because I use Axum with a custom middleware for a project. I don't think it's possible to use a Tower Middleware as-is inside graph-rs. So I looked into the implementation and found it way too complicated for what we needed.
What was the issue with Tower? I mean I get that things can get complicated and its not always necessary to implement more complex features.
I'll try to explain my reasoning for looking at Tower and possibly working with that crate.
In the future I would like to have the ability to use different http request crates such as hyper instead of reqwest and others. And the logic to implement http policies and a pipline would obviously have to take into account the different types of http clients that could be used. So instead of duplicating the code and rewriting each part for each new client it would be easier to have a system in place that can work for any client regardless. Tower does this pretty decently well as far as I can tell and Tower integrations are often provided for these clients.
Really its about having an abstraction in place to handle these two features, pipline policies and multi http client support, in a way that works well and doesn't require overly complex (possibly somewhat complex), and problematic code.
Now how this is done is really up to how we write the code. We could potentially come up with an abstraction that does this that is simpler than Tower - but how much simpler I don't know. So my thinking is if we can use something like Tower than it would be beneficial because it solves both problems of needing http retry policies and being able to use different http clients.
So I am really just wondering if there are issues that you ran into attempting to use Tower or if the complexity of it was just something that turned you away from it quickly. I understand either way. Just trying to gather more information.
I did not think you wanted to integrate tower that much. I don't have the full picture of Graph-rs but it sounds like a quite big refactoring. I did turn away from Tower quite quickly because it felt unnecessary for a relatively simple problem.
I think the point of the Arc was for the cloning. Either way you said that was not needed anymore but just wanted to mention that based on the article you linked. Was that the reason the method required the trait impl be wrapped in Arc to add it?
No idea.
It boils down to ownership issues. Because I want to carry the request between pipelines I need to bring the client with it.
I am not understanding why that would be needed though? The client is needed when the request is sent. The manipulation or creation of a policy shouldn't need the client. The policy will of course be used by the client at a later point in time but that doesn't mean its needed within this context. Maybe im failing to understand what your saying though?
Ugh. I just tried to instanciate the client in the transport policy only and yeah, it works fine. Can't remember why I thought I needed to bring the Client along. I do remember being annoyed having to do that though...
I did not think you wanted to integrate tower that much. I don't have the full picture of Graph-rs but it sounds like a quite big refactoring. I did turn away from Tower quite quickly because it felt unnecessary for a relatively simple problem.
I don't necessarily. I would rather just not have to implement it all by hand.
Ugh. I just tried to instanciate the client in the transport policy only and yeah, it works fine. Can't remember why I thought I needed to bring the Client along. I do remember being annoyed having to do that though...
Gotcha
I just pushed a rewrite of the pipeline to use Tower and it's services. 2 retries strategies are implemented : by number of attemps and waiting for the delay specified in the "Retry-After" header.
@sreeise Let me know what you think of this and if this sounds good I'll see if I can go further.
I just pushed a rewrite of the pipeline to use Tower and it's services. 2 retries strategies are implemented : by number of attemps and waiting for the delay specified in the "Retry-After" header.
@sreeise Let me know what you think of this and if this sounds good I'll see if I can go further.
Im just now seeing this. Been busy with a bunch of different things lately. Im gonna look at this soon within a few days to a week. This sounds awesome though.
@DevLazio
This is looking good! Thanks for continuing to work on this. Do you want to write some examples of its use and put them in the examples folder?
Sure !
Added a few lines of documentation for the new settings and completed the "client_configuration" example. Usage of the new settings is rather straigthforward I believe.
Implement retry functionality for the client to retry requests. Cases that this implementation should solve and objectives:
Functionality:
Any other important use cases and implementation notes? Feedback welcome.
Update:
Implementation should include the following 2 retry types at the minimum and more can be added or updated later on as well:
There may be cases where these conflict with each other and we will need to come up with a way to deal with this such as not using any other retry except the throttle retry when a 429 response code is returned.