Closed a-nldisr closed 6 years ago
Hello @a-nldisr, I have just successfully reviewed this and it does look like at first that the security groups may have been a problem, due to how long the cluster was up and how long it took for the public IP address took to resolve. However, it seems like AWS simply takes a bit longer for the security groups to be populated.
I have noticed this deploy myself, and I did not have to modify any of the existing security groups in order for it to complete.
As you can see here, the ELB security groups are in place here: https://github.com/dcos/terraform-dcos/blob/master/aws/master.tf#L90
You can also see that the security group for the public IP address of the master are here as well: https://github.com/dcos/terraform-dcos/blob/master/aws/main.tf#L139-L145
You can see my desired_cluster_profile here:
$ cat test.tfvars
num_of_masters = "3"
num_of_private_agents = "1"
num_of_public_agents = "1"
aws_region = "eu-central-1"
aws_bootstrap_instance_type = "t2.micro"
aws_master_instance_type = "c4.large"
aws_agent_instance_type = "c4.large"
aws_public_agent_instance_type = "c4.large"
dcos_master_discovery = "master_http_loadbalancer"
dcos_exhibitor_storage_backend = "aws_s3"
dcos_exhibitor_explicit_keys = "false"
ssh_key_name = "default"
# Inbound Master Access
admin_cidr = "0.0.0.0/0"
os="centos_7.4"
null_resource.master.0: Still creating... (9m30s elapsed)
null_resource.master.1: Still creating... (9m30s elapsed)
null_resource.master.2: Still creating... (9m30s elapsed)
null_resource.master[2] (remote-exec): loading DC/OS...
null_resource.master[1] (remote-exec): loading DC/OS...
null_resource.master[0] (remote-exec): loading DC/OS...
null_resource.master.1: Still creating... (9m40s elapsed)
null_resource.master.2: Still creating... (9m40s elapsed)
null_resource.master.0: Still creating... (9m40s elapsed)
null_resource.master[2] (remote-exec): loading DC/OS...
null_resource.master[1] (remote-exec): loading DC/OS...
null_resource.master[0] (remote-exec): loading DC/OS...
null_resource.master.1: Still creating... (9m50s elapsed)
null_resource.master.0: Still creating... (9m50s elapsed)
null_resource.master.2: Still creating... (9m50s elapsed)
null_resource.master[2]: Creation complete after 9m56s (ID: 8372252553155085616)
null_resource.master[1]: Creation complete after 9m56s (ID: 3737308134214863160)
null_resource.master[0]: Creation complete after 9m58s (ID: 1970291435718503002)
Apply complete! Resources: 22 added, 0 changed, 0 destroyed.
Outputs:
Bootstrap Public IP Address = 18.196.209.66
Master ELB Address = mbernadin-tfee0f-pub-mas-elb-1688980129.eu-central-1.elb.amazonaws.com
Mesos Master Public IP = [
18.196.16.163,
18.195.126.199,
35.157.11.108
]
Private Agent Public IP Address = [
18.196.207.85
]
Public Agent ELB Address = mbernadin-tfee0f-pub-agt-elb-1896123922.eu-central-1.elb.amazonaws.com
Public Agent Public IP Address = [
18.196.227.109
]
We intentionally used the public ip to query to ensure that when the user wanted to consume it, they knew that it was working because terraform had already checked it.
Since I do not see a problem, please let me know if you have any objections based on these findings so far.
Hi,
Reading the scripts i found the description:
description = "Used to allow HTTP and HTTPS access to DC/OS Adminrouter from the outside world specified by the user source range."
Based on this i expected that i would be able to limit this to my private ip to secure access to the DCOS cluster. I dont understand yet how adding the cluster-security-group to the bootstrap node fixed connection issues, but i will give it a try without changing the admin_cidr.
@a-nldisr, I wanted to follow up with you regarding this. It turns out that your issue you ran into was due to the fact that when the admin_cidr block was changed, there was no internal route to the AWS public IP address which is a valid issue. I went along and submitted a change to only query the private IP address instead. This will resolve the problem in the future if you ever decide to change the admin_cidr to something more strict.
Fixed: https://github.com/dcos/terraform-dcos/commit/8a34ba35a495455453852249570b7a2c7a84cdf2
Super! 👍
Hi all,
I have been using this Terraform script to see how DCOS deployments work on AWS. I have found 2 issues with them:
There is a security group rule missing for the Bootstrap nodes, this means that all agents and masters cannot fetch the scripts and requirements from the bootstrap node. After attaching the existing cluster-security-group network group manually to the bootstrap node all continue and can pull the requirements
There is a security group rule misssing for the masters, from what it looks like the run.sh script will try to do a curl to the local ip of the master. Since there is no security group rule, the master cannot check its state and will hang forever in:
null_resource.master[0] (remote-exec): loading DC/OS...
The desired_cluster_profile:
I have tried with the default instance types provided, no difference.