retaildevcrews / ngsa

Next Generation Symmetric Apps
MIT License
5 stars 7 forks source link

Default setup of aks-secure-baseline with NGSA #588

Closed AAkindele closed 3 years ago

AAkindele commented 3 years ago

Description

What:

Why:

Where:

Tasks

Acceptance Criteria

Constraints

References:

chenmliu commented 3 years ago

Suggestions:

  1. Change the deployment region to the one closest to you for better latency. Update the region1 variable in the following file to other acceptable regions, e.g., eastus: ..\enterprise_scale\construction_sets\aks\online\aks_secure_baseline\configuration\global_settings.tfvars
  2. Prefix all the resource groups names with a unique string to easy filtering. Prefix all the name variables in the following file with an unique string, e.g., demo or your alias/username: ..\enterprise_scale\construction_sets\aks\online\aks_secure_baseline\configuration\global_settings.tfvars

Observations:

  1. When you run the script, an unique 4-character string is all the resource groups names for easy filtering. So the script can be run against the same subscription multiple times without cause name conflict.

Troubleshooting cluster admin not set issue If you run the script from the CAF (Cloud Adoption Framework) repo (https://github.com/Azure/caf-terraform-landingzones-starter/tree/starter/enterprise_scale/construction_sets/aks/online/aks_secure_baseline) as is, by default the code/file to create a AAD user group and add it as cluster admin is commented out. And you'll run into issue at step 6 in this doc (https://github.com/Azure/caf-terraform-landingzones-starter/blob/starter/enterprise_scale/construction_sets/aks/online/aks_secure_baseline/02-aks.md). When you run kubectl get pods -n a0008, you will see neither of the two traefik ingress controllers are running (i.e., both have status 0/1). When you open the application url from the browser, you'll see a "502 Bad Gateway" error message.

If you try to connect to the workload directly by running kubectl run curl -n a0008 -i --tty --rm --image=mcr.microsoft.com/azure-cli --limits='cpu=200m,memory=128Mi', you will get the following error message:

Error from server (Forbidden): pods is forbidden: User "abc@xyz" cannot create resource "pods" in API group "" in the namespace "a0008"

A temporary walkaround:

  1. On the Azure portal, open the Kubernetes service created by Terraform, click 'Cluster configuration' in the middle menu, you will see "Admin Azure AD groups" not set. Click the edit button, and search for "PnP Deploy" in the list, and save the change.
  2. Repeat the steps starting from step 4. In step 6, verify that both traefik ingress controllers are up and running. https://github.com/Azure/caf-terraform-landingzones-starter/blob/starter/enterprise_scale/construction_sets/aks/online/aks_secure_baseline/02-aks.md

Now if you re-run kubectl run curl -n a0008 -i --tty --rm --image=mcr.microsoft.com/azure-cli --limits='cpu=200m,memory=128Mi' to connect to the workload directly again, you won't see any error message. In the open shell, type curl -kI https://bu0001a0008-00.aks-ingress.contoso.com -w '%{remote_ip}\n'. And you will see an IP address in the output. It should match the value of the Private DNS Zone created by the script. To verify, open the Private DNS Zone on the Azure portal, you will see the IP address listed in the "Value' column.

Solution 1:

  1. The script has the logic to create an AAD group with a pre-defined ID and name. This file is currently ignored. To have it executed, replace ignore with tfvars for the following file: ..\enterprise_scale\construction_sets\aks\online\aks_secure_baseline\configuration\iam\iam_aad.ignore
  2. The script also has the code to add the newly created AAD group as the cluster admin. The code is currently commented out. Uncomment the following lines in the role_based_access_control in the following file: ..\enterprise_scale\construction_sets\aks\online\aks_secure_baseline\configuration\aks.tfvars
        admin_group_object_ids = ["7304e4e7-b148-4ada-a135-6049c702d21e"]
        azuread_groups = {
          keys = ["aks_cluster_re1_admins"]
        }
  1. If you try to hit the url, the same "502 Bad Gateway" issue will still be there. The last step is to add yourself as the owner of the newly created AAD group on the Azure portal. You would have been a member already upon the completion of the script execution. If you run kubectl get pods -n a0008 again, you will see both traefik ingress controller up running.

  2. Further improvement - update iam_aad.tfvars file to automatically add yourself as the AAD group owner. Potentially parameterize your user info in a separate tfvars file.

Solution 2:

  1. Update the aks Terraform file to set the "PnP Deploy" AAD group as the cluster admin by including its ID and name in the role_based_access_control.admin_group_object_ids and role_based_access_control.azuread_groups, respectively
  2. Further improvement - parameterize these two values in either one of the existing tfvars files (e.g. global_settings.tfvars) or create a AAD specific tfvars file, so it's easy to customize/update these values in a separate file.

Other notes:

https://github.com/Azure/caf-terraform-landingzones-starter/blob/starter/enterprise_scale/construction_sets/aks/online/aks_secure_baseline/02-aks.md

When running kubectl get pods -n a0008, make sure both traefik ingress controllers is running. Expected output:

NAME                                                                 READY   STATUS    RESTARTS   AGE
aspnetapp-deployment-7ccf7cb7f9-6ltsd          1/1        Running   0                 61s
aspnetapp-deployment-7ccf7cb7f9-wh2lp        1/1        Running   0                 61s
traefik-ingress-controller-844fcdd859-k7dgj     0/1        Running   0                 58s
traefik-ingress-controller-844fcdd859-p6g8w   0/1        Running   0                 58s

Expected output on the browser

image