Closed mkhan037 closed 3 years ago
Are you able to reproduce this issue consistently, or is it intermittent?
Is there something special about requesting 25 instances, or do you also see the issue with smaller clusters?
I tried to reproduce this issue by using the same command 4 more times. However, this did not happen again. I did encounter issue #340 one of the time.
As for the number 25, there is nothing special other than I was trying to reproduce issue #340 with more than 20 nodes, as it was mentioned there that launching more than 20 nodes can trigger that issue. I did not encounter any issues with small clusters.
Strange. If this happens to you again and you get some more detail on what the cause is, please post it here.
From your original report, I can see that the initial problem is this:
botocore.exceptions.ClientError: An error occurred (DependencyViolation) when calling the
DeleteSecurityGroup operation: resource sg-0abb9850b14714211 has a dependent object
I'm guessing that happened only once because the cause is related to AWS resource consistency. Maybe Flintrock deleted something that sg-0abb9850b14714211
depends on, but there was a brief moment when that change didn't propagate sufficiently throughout AWS. So when Flintrock went to delete sg-0abb9850b14714211
itself, we got this dependency violation.
That's just a guess. But if that was the issue, then I suppose a general solution would be to adjust some boto settings to retry operations more times before failing (though I don't think those settings apply to DependencyViolation
errors), or perhaps to add waiters on specific operations that are known to take time to propagate out, like deleting a security group dependency (e.g. another security group, or a security group rule).
Strange. If this happens to you again and you get some more detail on what the cause is, please post it here. If I reencounter this issue, I will surely let you know. Thanks again for this awesome tool.
Thanks for the report!
When launching a 25 slave t2.micro spark-cluster to check whether issue #340 also happens in our case, I faced a different problem. Even though I mentioned the number of slaves to be 24, flintrock launched a total of 23 nodes at first. When I tried to destroy the cluster immediately, it encountered an error. And subsequent calls to destroy showed that the cluster had 2 running nodes and faced errors in destroying the cluster. I terminated the instances from the AWS console, after which flintrock destroy did not throw an error.
Note my current running instance vCPU limit is 32, however, that should not be an issue as the vCPU of the launched cluster was 25.
log