Closed greghendershott closed 8 years ago
@mflatt I pushed a commit on a new s3-retry
branch. Could you please take a look, and try?
One challenge I have, is that I never seem to get 50x responses from S3. Although I have what I feel are extensive unit tests, I'm not doing the volume of requests that you are in real use.
Also note I am trying to be good about @history
in the Scribble docs. :)
p.s. @mflatt The commit is doing two things:
se-max-tries
param to limit.Thanks for the history
notes!
I managed to provoke a retry through my script, and it looks like it exposed a bug:
aws: GET http://logs.racket-lang.org.s3.amazonaws.com/root/2014-07-20.gz returned 500, try 2 of 5 in 1 seconds.
application: not a procedure;
expected a procedure that can be applied to arguments
given: 2
arguments...:
#<input-port:logs.racket-lang.org.s3.amazonaws.com>
"HTTP/1.1 200 OK\r\nx-amz-id-2: DhQoLLf77Z+gYlFgTZMT+CBw0RiH6i2wmlncnIVndTacJ35muUl5rRlnotGTzP/SkGJ5Be8xpwE=\r\nx-amz-reque...
context...:
request/redirect/uri
request/redirect/uri
I just pushed commit f12917c to fix that dumb mistake.
This time it worked for me – one 50x provoked and handled successfully just now.
Good to hear that the 50x retry now works as expected.
I'm less confident that the other change -- exn handling in the workers for multipart uploads -- is the right thing to do.
I might wait until you've had more experience using this branch and a chance to give me more feedback, before merging to master....
On my most recent try, I got
HTTP 200 "OK". AWS Code="InternalError" Message="We encountered an internal error. Please try again." context...: /Users/mflatt/plt/extra-pkgs/aws/aws/s3.rkt:367:29
I don't know whether I've seen that one before. "200, but please try again" seems awkward to deal with.
Definitely awkward, but documented by Amazon for delete-multiple
(your case), copy
, and complete-multipart-upload
. And I'm actually checking in all three cases, which is why you did get an aws:exn:fail
instead of a silent failure. So that's good. But those 3 corner cases aren't detected by the general purpose request/retry functions I just added, which don't (and shouldn't try to) understand this "200 No Not OK". I think I need to add special handling to retry in those 3 spots.
I made these changes and am waiting for the tests to run OK before committing and pushing.
Unfortunately my tests don't exercise this error handling unless S3 happens to respond with the error. I'm getting the uncomfortable feeling that I ought to be mocking S3 somehow. Uncomfortable because although mocks are probably the right thing to do (?), I really don't want to take a first step down that rabbit hole....
I just pushed commit 21c4222 on the se-retry
branch. Please let me know your experience.
It has worked for me on three tries, only one of which would have deleted anything. Unfortunately, I forgot to turn on warning logging, so I don't know whether there were retries at all.
Still working for me. But I have remembered to turn on warnings and haven't seen any retries this week.
Well of course. Now that we want retries to happen, they become a Heisenfeature. :)
And although I hear Amazon provided a testable moment yesterday, it was a Dynamo outage not S3. :)
Thanks for keeping an eye on it. When I get some free time in the next day or two, I'd like to review my changes with now-less-familiar eyes. If I don't see any problems, and can't think of a better test strategy, I'll probably opt to go ahead and merge to master.
A run just now hit the "got 200 OK but error" case on a delete, and the automatic retry worked
Thank you for your help testing! Merged to master.
As requested by @mflatt
Also, consider attempting to retry after
exn:fail:network?
, maybe.