nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

VPC in subnet without internet gateway #375

Open a-cesari opened 2 months ago

a-cesari commented 2 months ago

Hi, I'm trying to launch a cluster on a private vpc without internet access (internet accessible only by a proxy). I setup the proxy and the no_proxy variables with the user data script. However the hadhoop health check is failing during installation. This is because of the proxy. The ec2 private ip address (which is called by the health check) should go inside the no_proxy. I have added the full vpc cidr block in the no_proxy env variable however urllib do not support cidr block in no_proxy. I have 2 ideas but I'm here to get your suggestions: 1) Add all the vpc ip in the no proxy. This could work however I'm getting problems in doing it in the user data because somehow the env variable get truncated after a bit 2) Add a cidr block parser inside the health check (I can try to propose a solution if you agree)

Can this be handled in other ways?

Thanks Andrea

nchammas commented 2 months ago

The ec2 private ip address (which is called by the health check) should go inside the no_proxy

Could you elaborate on this, please? What exactly is no_proxy and where is that defined?

a-cesari commented 2 months ago

no_proxy is an environment variable that defines which domains should not go through the proxy. I'm defining it in the user-data script of the ec2 machine. Of course, this is typically not needed if you are not using a proxy. But I am since I'm inside a fully private vpc and the only way to get internet access is through a proxy. However the ip address of the ec2 machine (that are in the vpc) should not be proxied hence you need to add them to the no_proxy variable. To be comprehensive I'm adding the full vpc cidr block to the no proxy. The problem is that urllib (that is used in flintrock to perform the Hadoop health check) is not supporting cidr block definition in the no_proxy variable. Hence when flintrock try to make a request to Hadoop api (which is the ip address of the ec2 machines ) in the cluster is going through the proxy and then blocked as expected. I hoper I explained a bit better otherwise let me know. This can maybe give a bit of context on the no_proxy variable: https://about.gitlab.com/blog/2021/01/27/we-need-to-talk-no-proxy/

Thanks

nchammas commented 1 month ago

Can you share the error that Flintrock spits out during the health check?

Can you also share an example config, including for NO_PROXY, so I can try to recreate this problem and better understand it?

a-cesari commented 1 month ago

Can you share the error that Flintrock spits out during the health check?

Can you also share an example config, including for NO_PROXY, so I can try to recreate this problem and better understand it?

Previously, I encountered a connection timeout error, but I've since resolved it (refer to the details below). If it's necessary for your troubleshooting, I can revert to the previous code version and execute it again to capture the error output. Please inform me if that is required.

Here’s how I addressed the issue:

I modified the health_check function within the services.py file to include the master host in the no_proxy environment variable. You can view the change at this GitHub link:

https://github.com/a-cesari/flintrock/blob/dab1955f9f12b2e50f7f5aa90826e60435f0fb52/flintrock/services.py#L285

I appended the master host to the no_proxy environment variable within the def health_check(self, master_host: str): function by adding the following line:

os.environ['no_proxy']=os.environ['no_proxy']+','+master_host

The rationale for this modification is as follows:

In enterprise settings with a private VPC and subnets, internet access from EC2 instances is typically routed through a proxy server. However, internal VPC traffic, such as pinging another instance or accessing AWS internal DNS, should bypass the proxy to avoid being blocked. The no_proxy environment variable is used to specify destinations that should not be routed through the proxy.

For the health check, the master host's IP address could be any within the VPC. My solution was to add the entire VPC CIDR block to the no_proxy variable. Python's urllib, however, does not support CIDR blocks in no_proxy.

In my opinion there could be different options, in case you will consider to handle these cases:

Let me know if I can help or if you need more info. Thanks,