AWS security rules are incorrect

a-nldisr commented 6 years ago

Hi all,

I have been using this Terraform script to see how DCOS deployments work on AWS. I have found 2 issues with them:

There is a security group rule missing for the Bootstrap nodes, this means that all agents and masters cannot fetch the scripts and requirements from the bootstrap node. After attaching the existing cluster-security-group network group manually to the bootstrap node all continue and can pull the requirements
There is a security group rule misssing for the masters, from what it looks like the run.sh script will try to do a curl to the local ip of the master. Since there is no security group rule, the master cannot check its state and will hang forever in: null_resource.master[0] (remote-exec): loading DC/OS...

The desired_cluster_profile:

num_of_masters = "3"
num_of_private_agents = "1"
num_of_public_agents = "1"
aws_region = "eu-central-1"
aws_bootstrap_instance_type = "t2.micro"
aws_master_instance_type = "c4.large"
aws_agent_instance_type = "c4.large"
aws_public_agent_instance_type = "c4.large"
dcos_master_discovery = "master_http_loadbalancer"
dcos_exhibitor_storage_backend = "aws_s3"
dcos_exhibitor_explicit_keys = "false"
ssh_key_name = "my_key"
# Inbound Master Access
admin_cidr = "my_public_ip"
os="centos_7.4"

I have tried with the default instance types provided, no difference.

bernadinm commented 6 years ago

Hello @a-nldisr, I have just successfully reviewed this and it does look like at first that the security groups may have been a problem, due to how long the cluster was up and how long it took for the public IP address took to resolve. However, it seems like AWS simply takes a bit longer for the security groups to be populated.

I have noticed this deploy myself, and I did not have to modify any of the existing security groups in order for it to complete.

As you can see here, the ELB security groups are in place here: https://github.com/dcos/terraform-dcos/blob/master/aws/master.tf#L90

You can also see that the security group for the public IP address of the master are here as well: https://github.com/dcos/terraform-dcos/blob/master/aws/main.tf#L139-L145

You can see my desired_cluster_profile here:

$ cat test.tfvars
num_of_masters = "3"
num_of_private_agents = "1"
num_of_public_agents = "1"
aws_region = "eu-central-1"
aws_bootstrap_instance_type = "t2.micro"
aws_master_instance_type = "c4.large"
aws_agent_instance_type = "c4.large"
aws_public_agent_instance_type = "c4.large"
dcos_master_discovery = "master_http_loadbalancer"
dcos_exhibitor_storage_backend = "aws_s3"
dcos_exhibitor_explicit_keys = "false"
ssh_key_name = "default"
# Inbound Master Access
admin_cidr = "0.0.0.0/0"
os="centos_7.4"

Completed Output

null_resource.master.0: Still creating... (9m30s elapsed)
null_resource.master.1: Still creating... (9m30s elapsed)
null_resource.master.2: Still creating... (9m30s elapsed)
null_resource.master[2] (remote-exec): loading DC/OS...
null_resource.master[1] (remote-exec): loading DC/OS...
null_resource.master[0] (remote-exec): loading DC/OS...
null_resource.master.1: Still creating... (9m40s elapsed)
null_resource.master.2: Still creating... (9m40s elapsed)
null_resource.master.0: Still creating... (9m40s elapsed)
null_resource.master[2] (remote-exec): loading DC/OS...
null_resource.master[1] (remote-exec): loading DC/OS...
null_resource.master[0] (remote-exec): loading DC/OS...
null_resource.master.1: Still creating... (9m50s elapsed)
null_resource.master.0: Still creating... (9m50s elapsed)
null_resource.master.2: Still creating... (9m50s elapsed)
null_resource.master[2]: Creation complete after 9m56s (ID: 8372252553155085616)
null_resource.master[1]: Creation complete after 9m56s (ID: 3737308134214863160)
null_resource.master[0]: Creation complete after 9m58s (ID: 1970291435718503002)

Apply complete! Resources: 22 added, 0 changed, 0 destroyed.

Outputs:

Bootstrap Public IP Address = 18.196.209.66
Master ELB Address = mbernadin-tfee0f-pub-mas-elb-1688980129.eu-central-1.elb.amazonaws.com
Mesos Master Public IP = [
    18.196.16.163,
    18.195.126.199,
    35.157.11.108
]
Private Agent Public IP Address = [
    18.196.207.85
]
Public Agent ELB Address = mbernadin-tfee0f-pub-agt-elb-1896123922.eu-central-1.elb.amazonaws.com
Public Agent Public IP Address = [
    18.196.227.109
]

We intentionally used the public ip to query to ensure that when the user wanted to consume it, they knew that it was working because terraform had already checked it.

Since I do not see a problem, please let me know if you have any objections based on these findings so far.

a-nldisr commented 6 years ago

Hi,

Reading the scripts i found the description: description = "Used to allow HTTP and HTTPS access to DC/OS Adminrouter from the outside world specified by the user source range."

Based on this i expected that i would be able to limit this to my private ip to secure access to the DCOS cluster. I dont understand yet how adding the cluster-security-group to the bootstrap node fixed connection issues, but i will give it a try without changing the admin_cidr.

bernadinm commented 6 years ago

@a-nldisr, I wanted to follow up with you regarding this. It turns out that your issue you ran into was due to the fact that when the admin_cidr block was changed, there was no internal route to the AWS public IP address which is a valid issue. I went along and submitted a change to only query the private IP address instead. This will resolve the problem in the future if you ever decide to change the admin_cidr to something more strict.

Fixed: https://github.com/dcos/terraform-dcos/commit/8a34ba35a495455453852249570b7a2c7a84cdf2

a-nldisr commented 6 years ago

Super! 👍

mesosphere-backup / terraform-dcos

AWS security rules are incorrect #35

Completed Output