I believe we need to implement not only the circuit breaker pattern, but also introduce timeouts and retries in the calls to edx.
All this should be implemented in the edx-api-client library that should return proper exceptions to be handled inside micromasters.
Part 1: Circuit breakers
Implementing the circuit breaker patter is pretty straight forward using pybreaker (see docs):
this will cause an usual error for the first 5 failed attempts and then a CircuitBreakerError in case the circuit breaker opens or for the trial calls.
We should also mask the different exceptions that need to be raised.
The hard part of implementing this pattern is to figure out the right values for fail_max and reset_timeout. We should probably make some preventive monitoring on the calls to the different EDX API endpoints and then decide what can be acceptable for us an our users.
Something I have not decided yet is if it is better to create a circuit breaker per edx endpoint or a global one. Both solutions have pro/cons.
Part 2: Timeouts
Introducing timeouts in the edx-api-client is, again, pretty straight forward: we use the requests library that allows to specify a timeout for each call (see docs).
In our case an example of a possible implementation might be:
Again, the hard part here is to figure out how much time to wait is acceptable for us and our users.
Part 3: Retries
Retries in the requests library are not really difficult to implement, given that we already use sessions in the edx-api-client (see docs):
>>> import requests
>>> s = requests.Session()
>>> a = requests.adapters.HTTPAdapter(max_retries=3)
>>> s.mount('http://', a)
If we want to limit the retries to specific HTTP errors, it is slightly more complicated (see this example).
For the reties it might be a bit complicated to figure out how to combine this parameter with the limit to open the circuit breaker, given that all the failed retry requests will count and one call might open the circuit breaker.
I believe we need to implement not only the circuit breaker pattern, but also introduce timeouts and retries in the calls to edx. All this should be implemented in the edx-api-client library that should return proper exceptions to be handled inside micromasters.
Part 1: Circuit breakers
Implementing the circuit breaker patter is pretty straight forward using pybreaker (see docs):
then we can decorate the functions that make the actual calls without changing anything elseama:
this will cause an usual error for the first 5 failed attempts and then a
CircuitBreakerError
in case the circuit breaker opens or for the trial calls. We should also mask the different exceptions that need to be raised.The hard part of implementing this pattern is to figure out the right values for
fail_max
andreset_timeout
. We should probably make some preventive monitoring on the calls to the different EDX API endpoints and then decide what can be acceptable for us an our users.Something I have not decided yet is if it is better to create a circuit breaker per edx endpoint or a global one. Both solutions have pro/cons.
Part 2: Timeouts
Introducing timeouts in the
edx-api-client
is, again, pretty straight forward: we use therequests
library that allows to specify a timeout for each call (see docs).In our case an example of a possible implementation might be:
Again, the hard part here is to figure out how much time to wait is acceptable for us and our users.
Part 3: Retries
Retries in the
requests
library are not really difficult to implement, given that we already usesessions
in theedx-api-client
(see docs):If we want to limit the retries to specific HTTP errors, it is slightly more complicated (see this example).
For the reties it might be a bit complicated to figure out how to combine this parameter with the limit to open the circuit breaker, given that all the failed retry requests will count and one call might open the circuit breaker.