Amazon EC2 "503 request limit exceeded" errors

andreaturli commented 11 years ago

Intermittently running against EC2 it can happen to see:

org.jclouds.http.HttpResponseException: command:

POST https://ec2.eu-west-1.amazonaws.com/ HTTP/1.1 failed with response: HTTP/1.1 503 Service Unavailable; content: [Request limit exceeded.]

This issue was reported and discussed also at https://groups.google.com/forum/#!msg/jclouds-dev/WtNzfqtNfuE/PrYXsjP8RTYJ

spragues-trulia commented 11 years ago

Hi there. I will also chime in here that this bug is hitting me as well. The more nodes you attempt to create - the more likely you'll receive the "request limit exceeded" error from EC2. For my part i'm using Whirr as the wrapper over jclouds.

I did see reference to this back in May 2012 here: https://groups.google.com/forum/#!msg/jclouds-dev/MLYsvOS025o/n1CtL5yGhasJ

does it look like there's any hope to solve this one?

codefromthecrypt commented 11 years ago

Hi, there. there's a significant amount of work on towards this in 1.6.

This is the larger issue about controlling commands better: https://github.com/jclouds/jclouds/issues/1089

This is in 1.6.0-alpha.2 and changes to use multi-id describe calls when polling for instances active: https://github.com/jclouds/jclouds/commit/bd4f5cfba2d34a6e995e1c29cffc827979961cff

This is the start for openstack, where the issue arises more often. A similar exception coercion on 503 may be possible on ec2, depending on whether retryAfter information is available. If not, the only approach is to further work on reducing calls: https://github.com/jclouds/jclouds/pull/1056

There's more to do, and this is not forgotten. I'll keep this open until it is sorted, guessing by March depending on if anyone helps.

spragues-trulia commented 11 years ago

Awesome Adrian. Thank you for the update!

spragues-trulia commented 11 years ago

Hi Adrian, Just checkin to see if all is okay! Vitamins are being taken, pizza is still be delivered on time, crime is low in the neighborhood and possibly, maybe, this work in 1.6 is proving doable? :)

cheers!

spragues-trulia commented 11 years ago

Hey Adrian. I imagine you are super busy but is there any way you can throw me bone on this one? Just looking for any kind of update.

demobox commented 11 years ago

Hi Stephen

Just looking for any kind of update.

Adrian will certainly be better placed to give a definitive update - just wondering whether you've had a chance to test recently using the latest 1.6.0-rc.1 release? That should include some of the changes referenced in this issue, and perhaps is already helping improve the situation a little.

spragues-trulia commented 11 years ago

Hi, Thanks for replying. I'm using Apache Whirr actually (which uses the jclouds libs) and after giving it my best shot I've found that the current release of that doesn't work with the newest release of this. So now i've got to regroup and figure out where to go from here.

Thanks!

demobox commented 11 years ago

current release of that doesn't work with the newest release of this

Ack. Sorry to hear, Stephen. @abayer: I see a patch to upgrade to 1.5.8. Is there a chance of looking at 1.6.0-rc.1?

tralfamadude commented 11 years ago

I've been wanting this one for a long time since I have experience generating 20 4-node clusters. I avoid HTTP 503 (Request limit exceeded) by ad hoc timing at the top level (time between consecutive cluster creates) and by only doing one cluster create at a time. This is not ideal since it requires that I leave plenty of slack to avoid an avalanche and I want the clusters ready for a deadline.

Worst case scenario: everyone and everything tries harder to contact AWS EC2 services. In 2011, my client was locked out of EC2 api and AWS Console for 6 hours. I have not seen that again, so I hope AWS realized their user experience error and fixed it.

Note that getting 503 means commandline Whirr will quit and allocated nodes will be completely inaccessible in the short term. I call these
"orphaned nodes", and human attention is needed to clean them up. (I am adding auto-destruct timers to my custom AMIs to clean these up.)

AWS EC2 will fulfill, but slow down (increase average latency), api requests before it kills the conversation with HTTP 503. The duration of a list-instances or tag-instances request increases dramatically (10x to 50x) before HTTP 503, so detecting a 3x increase in latency should be sufficient to trigger "extra patience".

I don't know if OpenStack does the same, but the source code is available. What I would like to see is REST services add HTTP headers to responses to provide "hold off for N secs" meta-comments. I've done this in REST services I implemented to add detailed error messages since HTTP codes can have multiple meanings.

Implementation Comments:

api operation-specific retry timing helps (especially if number of open sockets counts against you), but does not address the base request rate.
eliminating unnecessary requests is essential, of course.
being able to recover from a 503 would be wonderful, but might be difficult since assessing remote state requires making more requests.
It should be possible to detect throttling before the HTTP 503, at least for AWS EC2. "Extra patience" could be implemented as a globally enforced hold on aws ec2 requests for N seconds (settable property), perhaps with a multiplier like 1.3 for each incident.

Obviously, dynamic anti-throttling requires a centralized flow control mechanism that can sleep before making a request when the "Extra patience" countdown timer is > 0. Also, there should be no adverse overhead if this mechanism is disabled.

On 20130204 18:39 , Adrian Cole wrote:

Hi, there. there's a significant amount of work on towards this in 1.6.

This is the larger issue about controlling commands better:

1089 https://github.com/jclouds/jclouds/issues/1089

This is in 1.6.0-alpha.2 and changes to use multi-id describe calls when polling for instances active: bd4f5cf https://github.com/jclouds/jclouds/commit/bd4f5cfba2d34a6e995e1c29cffc827979961cff

This is the start for openstack, where the issue arises more often. A similar exception coercion on 503 may be possible on ec2, depending on whether retryAfter information is available. If not, the only approach is to further work on reducing calls:

1056 https://github.com/jclouds/jclouds/issues/1056

There's more to do, and this is not forgotten. I'll keep this open until it is sorted, guessing by March depending on if anyone helps.

charlesmunger commented 11 years ago

I just tested using the 1.6.0-rc.1 release, and I'm getting this error when jclouds attempts to customize my nodes.

I requested 32 CC2.8xlarge instances, put into a placement group. I can split up the deployment, but that's problematic since occasionally amazon doesn't have capacity to put them all in one group, so I end up with half the nodes I need already reserved.

EDIT: Further investigation reveals that the problem doesn't manifest when I set: properties.setProperty(AWSEC2Constants.PROPERTY_EC2_GENERATE_INSTANCE_NAMES, "false");

spragues-trulia commented 11 years ago

well.... i've pretty much given up on this but i am kinda curious if it ever got resolved. something tells me no.

codefromthecrypt commented 11 years ago

well to add context, many of us have been busy getting jclouds ready for transition into apache, something that displaces time for issues like this, for a long-term greater good. Please follow https://github.com/jclouds/jclouds/issues/1576 and open a jira on apache incubator jclouds as soon as it is up.

charlesmunger commented 11 years ago

Did you try it with the workaround I posted above?

spragues-trulia commented 11 years ago

@adriancole - that's good to hear. cool.

@charlesmunger - yeah that looks like java code. i'm a layer or two above that as i access the jclouds libs via apache whirr for which its not clear to me how to influence that setting. i will nose around though. thanks.

jclouds / legacy-jclouds

Amazon EC2 "503 request limit exceeded" errors #1214

1089 https://github.com/jclouds/jclouds/issues/1089

1056 https://github.com/jclouds/jclouds/issues/1056