S3: Handle 50x responses with retries

greghendershott commented 9 years ago

As requested by @mflatt

Also, consider attempting to retry after exn:fail:network?, maybe.

greghendershott commented 9 years ago

@mflatt I pushed a commit on a new s3-retry branch. Could you please take a look, and try?

One challenge I have, is that I never seem to get 50x responses from S3. Although I have what I feel are extensive unit tests, I'm not doing the volume of requests that you are in real use.

Also note I am trying to be good about @history in the Scribble docs. :)

greghendershott commented 9 years ago

p.s. @mflatt The commit is doing two things:

Retries on certain 50x response codes. See se-max-tries param to limit.
The worker pool used for multipart uploads, now catches exns and returns the jobs to the todo list. I'm not sure what you think about this. One catch is that, as it stands, there is no limit on retries, for this. My thinking is that's OK, because you'd want it to keep trying until you break or kill it. But, I'm curious what you think?

mflatt commented 9 years ago

Thanks for the history notes!

I managed to provoke a retry through my script, and it looks like it exposed a bug:

aws: GET http://logs.racket-lang.org.s3.amazonaws.com/root/2014-07-20.gz returned 500, try 2 of 5 in 1 seconds.
application: not a procedure;
 expected a procedure that can be applied to arguments
  given: 2
  arguments...:
   #<input-port:logs.racket-lang.org.s3.amazonaws.com>
   "HTTP/1.1 200 OK\r\nx-amz-id-2: DhQoLLf77Z+gYlFgTZMT+CBw0RiH6i2wmlncnIVndTacJ35muUl5rRlnotGTzP/SkGJ5Be8xpwE=\r\nx-amz-reque...
  context...:
   request/redirect/uri
   request/redirect/uri

greghendershott commented 9 years ago

I just pushed commit f12917c to fix that dumb mistake.

mflatt commented 9 years ago

This time it worked for me – one 50x provoked and handled successfully just now.

greghendershott commented 9 years ago

Good to hear that the 50x retry now works as expected.

I'm less confident that the other change -- exn handling in the workers for multipart uploads -- is the right thing to do.

I might wait until you've had more experience using this branch and a chance to give me more feedback, before merging to master....

mflatt commented 9 years ago

On my most recent try, I got

HTTP 200 "OK". AWS Code="InternalError" Message="We encountered an internal error. Please try again." context...: /Users/mflatt/plt/extra-pkgs/aws/aws/s3.rkt:367:29

I don't know whether I've seen that one before. "200, but please try again" seems awkward to deal with.

greghendershott commented 9 years ago

Definitely awkward, but documented by Amazon for delete-multiple (your case), copy, and complete-multipart-upload. And I'm actually checking in all three cases, which is why you did get an aws:exn:fail instead of a silent failure. So that's good. But those 3 corner cases aren't detected by the general purpose request/retry functions I just added, which don't (and shouldn't try to) understand this "200 No Not OK". I think I need to add special handling to retry in those 3 spots.

greghendershott commented 9 years ago

I made these changes and am waiting for the tests to run OK before committing and pushing.

Unfortunately my tests don't exercise this error handling unless S3 happens to respond with the error. I'm getting the uncomfortable feeling that I ought to be mocking S3 somehow. Uncomfortable because although mocks are probably the right thing to do (?), I really don't want to take a first step down that rabbit hole....

greghendershott commented 9 years ago

I just pushed commit 21c4222 on the se-retry branch. Please let me know your experience.

mflatt commented 9 years ago

It has worked for me on three tries, only one of which would have deleted anything. Unfortunately, I forgot to turn on warning logging, so I don't know whether there were retries at all.

mflatt commented 8 years ago

Still working for me. But I have remembered to turn on warnings and haven't seen any retries this week.

greghendershott commented 8 years ago

Well of course. Now that we want retries to happen, they become a Heisenfeature. :)

And although I hear Amazon provided a testable moment yesterday, it was a Dynamo outage not S3. :)

Thanks for keeping an eye on it. When I get some free time in the next day or two, I'd like to review my changes with now-less-familiar eyes. If I don't see any problems, and can't think of a better test strategy, I'll probably opt to go ahead and merge to master.

mflatt commented 8 years ago

A run just now hit the "got 200 OK but error" case on a delete, and the automatic retry worked

greghendershott commented 8 years ago

Thank you for your help testing! Merged to master.

greghendershott / aws

S3: Handle 50x responses with retries #50