nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

Flintrock hangs when querying checkip.amazonaws.com #244

Closed reisfe closed 4 years ago

reisfe commented 6 years ago

I'm trying to launch a cluster, but flintrock "hangs" (does not return to the terminal) and does not return any output or error (even with --debug). From what I see in the AWS Console the instances are not being created.

Is there a way to get more info from flintrock to debug this problem?

I tried both an installation via pip and the standalone version from GitHub, with the same result. The options --version and --help give the output one would expect. The command configure also works ok.

Changing some config parameters: Giving the wrong ec2-identity-file -> returns error. Giving the wrong key-name -> flintrock hangs Giving non-existing region -> returns error Giving wrong ami -> flintrock hangs Giving non-existing instance-profile-name -> flintrock hangs

Accessing AWS with AWS-CLI works fine, so credentials should be properly configured.

The botocore version is not the one required in setup.py (1.7.36), but I suppose the standalone version comes with its own version of botocore.

nchammas commented 6 years ago

Python version: 2.7.14

Flintrock requires Python 3.4+. I'm surprised that Flintrock just hangs though.

What version of setuptools do you have? We enforce the version check in setup.py.

And do you still have issues if you run Flintrock with Python 3.4+?

reisfe commented 6 years ago

I have both Python 2 and Python 3 installed, but forgot that flintlock uses Python 3 and so I gave above the wrong versions numbers. Python 3 version is 3.6.5 and setuptools version is 39.0.1. See below output of pip3.

$ pip3 install --upgrade flintrock
Requirement already up-to-date: flintrock in /usr/local/lib/python3.6/site-packages
Requirement already up-to-date: paramiko==2.1.1 in /usr/local/lib/python3.6/site-packages (from flintrock)
Requirement already up-to-date: boto3==1.4.4 in /usr/local/lib/python3.6/site-packages (from flintrock)
Requirement already up-to-date: cryptography>=1.7.2 in /usr/local/lib/python3.6/site-packages (from flintrock)
Requirement already up-to-date: botocore==1.5.10 in /usr/local/lib/python3.6/site-packages (from flintrock)
Requirement already up-to-date: python-dateutil>=2.5.3 in /usr/local/lib/python3.6/site-packages (from flintrock)
Requirement already up-to-date: click==6.7 in /usr/local/lib/python3.6/site-packages (from flintrock)
Requirement already up-to-date: PyYAML==3.12 in /usr/local/lib/python3.6/site-packages (from flintrock)
Requirement already up-to-date: pyasn1>=0.1.7 in /usr/local/lib/python3.6/site-packages (from paramiko==2.1.1->flintrock)
Requirement already up-to-date: s3transfer<0.2.0,>=0.1.10 in /usr/local/lib/python3.6/site-packages (from boto3==1.4.4->flintrock)
Requirement already up-to-date: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/site-packages (from boto3==1.4.4->flintrock)
Requirement already up-to-date: idna>=2.1 in /usr/local/lib/python3.6/site-packages (from cryptography>=1.7.2->flintrock)
Requirement already up-to-date: asn1crypto>=0.21.0 in /usr/local/lib/python3.6/site-packages (from cryptography>=1.7.2->flintrock)
Requirement already up-to-date: six>=1.4.1 in /usr/local/lib/python3.6/site-packages (from cryptography>=1.7.2->flintrock)
Requirement already up-to-date: cffi>=1.7; platform_python_implementation != "PyPy" in /usr/local/lib/python3.6/site-packages (from cryptography>=1.7.2->flintrock)
Requirement already up-to-date: docutils>=0.10 in /usr/local/lib/python3.6/site-packages (from botocore==1.5.10->flintrock)
Requirement already up-to-date: pycparser in /usr/local/lib/python3.6/site-packages (from cffi>=1.7; platform_python_implementation != "PyPy"->cryptography>=1.7.2->flintrock)

The versions of boto3, botocore and paramiko are lower than currently required in the repository, but I supposed that was changed after the release version.

reisfe commented 6 years ago

After installing flintrock on my Mac, and waiting for a long time (5m5.931s), I got the following error message.

$ flintrock launch my_cluster
Traceback (most recent call last):
  File "/usr/local/bin/flintrock", line 11, in <module>
    load_entry_point('Flintrock==0.9.0', 'console_scripts', 'flintrock')()
  File "/usr/local/Cellar/flintrock/0.9.0/libexec/lib/python3.6/site-packages/flintrock/flintrock.py", line 1132, in main
    cli(obj={})
  File "/usr/local/Cellar/flintrock/0.9.0/libexec/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/Cellar/flintrock/0.9.0/libexec/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/Cellar/flintrock/0.9.0/libexec/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/Cellar/flintrock/0.9.0/libexec/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/Cellar/flintrock/0.9.0/libexec/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/Cellar/flintrock/0.9.0/libexec/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/Cellar/flintrock/0.9.0/libexec/lib/python3.6/site-packages/flintrock/flintrock.py", line 403, in launch
    tags=ec2_tags)
  File "/usr/local/Cellar/flintrock/0.9.0/libexec/lib/python3.6/site-packages/flintrock/ec2.py", line 53, in wrapper
    res = func(*args, **kwargs)
  File "/usr/local/Cellar/flintrock/0.9.0/libexec/lib/python3.6/site-packages/flintrock/ec2.py", line 883, in launch
    instance_profile_arn = iam.InstanceProfile(instance_profile_name).arn
  File "/usr/local/Cellar/flintrock/0.9.0/libexec/lib/python3.6/site-packages/boto3/resources/factory.py", line 339, in property_loader
    self.load()
  File "/usr/local/Cellar/flintrock/0.9.0/libexec/lib/python3.6/site-packages/boto3/resources/factory.py", line 505, in do_action
    response = action(self, *args, **kwargs)
  File "/usr/local/Cellar/flintrock/0.9.0/libexec/lib/python3.6/site-packages/boto3/resources/action.py", line 83, in __call__
    response = getattr(parent.meta.client, operation_name)(**params)
  File "/usr/local/Cellar/flintrock/0.9.0/libexec/lib/python3.6/site-packages/botocore/client.py", line 253, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/Cellar/flintrock/0.9.0/libexec/lib/python3.6/site-packages/botocore/client.py", line 543, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetInstanceProfile operation: User: arn:aws:iam::861291133300:user/Flintrock is not authorized to perform: iam:GetInstanceProfile on resource: instance profile EC2AccessToS3b

So it's a problem with the permissions of the IAM user. I wonder if it's normal that it takes so much time.

reisfe commented 6 years ago

I think it would be good if flintrock captures this exception and returns to the user gently, because it should be normal that one would restrict the permissions of the AWS IAM user launching resources. Adding to the README what permissions flintrock requires would also be great.

reisfe commented 6 years ago

I got the same error in my linux box after 8m34.404s. I guess I just have not been patient enough...

reisfe commented 6 years ago

Success!!! I managed to launch my first spark cluster in the cloud. Thank you very much for flintrock. It's exactly what I was looking for and I was ready to program such tool myself. You saved me countless hours.

Even if successful, it takes 5+ minutes for the message "Launching x instances..." to appear. It's not clear where the problem is, in flintrock or in AWS.

nchammas commented 6 years ago

So it's a problem with the permissions of the IAM user. I wonder if it's normal that it takes so much time.

Glad you found the issue. It's definitely not normal to take so much time. In fact, I just tried on my side to revoke my own permissions and work with Flintrock, and it fails instantly. No idea why it takes several minutes to do the same for you. Maybe it's something about the region you're in?

I think it would be good if flintrock captures this exception and returns to the user gently

Since there are so many ways something could fail on AWS, even as far as permissions are concerned, it's good to forward along to the user the full error so that they can debug what's going on. Perhaps we could suppress the lengthy traceback, but it's not obvious how to do that without losing potentially useful information about what call specifically is failing.

Success!!! I managed to launch my first spark cluster in the cloud. Thank you very much for flintrock. It's exactly what I was looking for and I was ready to program such tool myself. You saved me countless hours.

Glad to hear Flintrock is helping you.

Even if successful, it takes 5+ minutes for the message "Launching x instances..." to appear. It's not clear where the problem is, in flintrock or in AWS.

Seems like an AWS issue that may specific to your region or network. I wonder if you see the same behavior if you try running Flintrock a) from a different host, b) on a different network, or c) against a different AWS region. That would help narrow down where the issue is.

nchammas commented 6 years ago

@reisfe - Would you like to continue debugging the long pauses when attempting to launch clusters, or shall we close this issue?

reisfe commented 6 years ago

Let me try to debug the long pauses still for some more days. Even if the problem is not in flintrock, maybe some other people might run into the same issue.

reisfe commented 6 years ago

I pinpointed the problem to the call to 'http://checkip.amazonaws.com' used to obtain the users external IP address in the function get_or_create_flintrock_security_groups.

Apparently there's some compatibility problem between the urllib library (got problems with Python packages urllib and requests, both using urllib library) and the web service of AWS to get the users IP address. I couldn't figure out what the exact problem is. The urllib works with other similar services and the AWS checkip service works via other channels (e.g. browser, curl). I found the problem in two of my machines (one gnu/linux, one mac-os), but the problem does not seem to be my network as I got the same problem using a different wifi hotspot.

SOLUTION: I patched the file ec2.py in the installed flintrock (using python3 -m site to find the path to the site-packages) and replaced the AWS service with another one, https://api.ipify.org/

SUGGESTIONS: Flintrock could have a time-out (although in the case above it did get the IP address after 5 minutes) combined with the use of other alternative external services to get the user's IP address.

nchammas commented 6 years ago

Thanks for this analysis @reisfe. I'm glad we've pinpointed where the hang is coming from.

I'm wary of adding another IP service without understanding the cause of the hangs, because another user may come along later and report that they are experiencing the same hangs but with this other service. We already know between me and you that not everyone experiences the same behavior when hitting checkip.amazonaws.com.

I'd like to dig a bit further to see if we can figure out why you are experience hangs with urllib.requests. That would be my preferred solution. Failing that, I suppose the next best thing is to add the timeout and alternate IP service as you suggested.

This answer, and several answers to similar questions on Stack Overflow, suggest that the hang may be caused by a missing User-Agent string. Perhaps the User-Agent is set differently across our machines.

Do you still experience the hang if you update the lines you pointed to to the following?

request = urllib.request.Request('http://checkip.amazonaws.com/', headers={'User-Agent': 'Flintrock/0.9.0'})
flintrock_client_ip = urllib.request.urlopen(request).read().decode('utf-8').strip()

I would also try the following User-Agent strings for good measure:

nchammas commented 4 years ago

Closing due to inactivity. Happy to reopen this and continue investigating if anyone is still experiencing this issue and the User-Agent suggestion doesn't help.