Whenever we're verifying a SNS message, we have to fetch the public certificate from an HTTP url provided to us by Amazon. If fetching this fails for any reason, we will error and will rely on SNS retrying the request to get it accurately recorded.
We can do better!
There are two possible strategies I can think of here, and the right answer might be to use one or the other, or both.
Cache the public key.
The HTTP response at the URL does not indicate that it can be cached, however on the AWS forums AWS has indicated that if/when they change the certificate they will use a different URL. That means one option here is to just cache the signing certificate for a long time. This could either just be a simply in memory cache (in which case we will refetch it anytime we restart the process) or utilizing redis to store the cached signing URL so that the cache survives restarts, is shared amongst processes etc.
This cache should expire some how, probably some sort of LRU that keeps some number of keys but will evict older ones when needed.
Add retries.
Whenever we get an error, simply try fetching it again! This will make the HTTP request take longer and it's possible that whatever network error is effecting us will last longer then we're willing to have a single request take, so it doesn't eliminate the problem, but makes us survive momentary blips better.
My opinion is I'd start with caching, ideally with a redis based cache and see where that leaves us. It will likely make the failures infrequent enough as to not be worth worrying about, and will make verifying the signature faster as well.
With retries and https://github.com/pypa/warehouse/pull/4526 this is alrgely done. I'm going to leave this open because I believe that adding caching here would still be a good step.
Whenever we're verifying a SNS message, we have to fetch the public certificate from an HTTP url provided to us by Amazon. If fetching this fails for any reason, we will error and will rely on SNS retrying the request to get it accurately recorded.
We can do better!
There are two possible strategies I can think of here, and the right answer might be to use one or the other, or both.
My opinion is I'd start with caching, ideally with a redis based cache and see where that leaves us. It will likely make the failures infrequent enough as to not be worth worrying about, and will make verifying the signature faster as well.