stub_domains test failed

czka commented 5 years ago

As of current master (3f7527e583ffa07e6a06250844e07c38556a4488 when writing this), with a following test/fixtures/shared/terraform.tfvars:

project_id="redacted-project-name"
credentials_path_relative="../../../credentials.json"
region="europe-west1"
zones=["europe-west1-c"]
compute_engine_service_account="redacted@developer.gserviceaccount.com"

make docker_build_kitchen_terraform, make docker_run, kitchen create and kitchen converge passed fine.

kitchen verify passed fine for deploy_service, node_pool, shared_vpc, simple_regional and simple_zonal. It failed at stub_domains as follows:

Verifying stub_domains

Profile: stub_domain
Version: (not specified)
Target:  local://

  ×  gcloud: Google Compute Engine GKE configuration (1 failed)
     ✔  Command: `gcloud --project=redacted-project-name container clusters --zone=europe-west1 describe stub-domains-cluster-zwwa --format=json` exit_status should eq 0
     ✔  Command: `gcloud --project=redacted-project-name container clusters --zone=europe-west1 describe stub-domains-cluster-zwwa --format=json` stderr should eq ""
     ✔  Command: `gcloud --project=redacted-project-name container clusters --zone=europe-west1 describe stub-domains-cluster-zwwa --format=json` cluster is running
     ×  Command: `gcloud --project=redacted-project-name container clusters --zone=europe-west1 describe stub-domains-cluster-zwwa --format=json` cluster has the expected addon settings

     expected: {"horizontalPodAutoscaling"=>{}, "httpLoadBalancing"=>{}, "kubernetesDashboard"=>{"disabled"=>true}, "networkPolicyConfig"=>{}}
          got: {"horizontalPodAutoscaling"=>{}, "httpLoadBalancing"=>{}, "kubernetesDashboard"=>{"disabled"=>true}, "networkPolicyConfig"=>{"disabled"=>true}}

     (compared using ==)

     Diff:
     @@ -1,5 +1,5 @@
      "horizontalPodAutoscaling" => {},
      "httpLoadBalancing" => {},
      "kubernetesDashboard" => {"disabled"=>true},
     -"networkPolicyConfig" => {},
     +"networkPolicyConfig" => {"disabled"=>true},

  ✔  kubectl: Kubernetes configuration
     ✔  kubernetes configmap kube-dns is created by Terraform
     ✔  kubernetes configmap kube-dns reflects the stub_domains configuration
     ✔  kubernetes configmap ipmasq is created by Terraform
     ✔  kubernetes configmap ipmasq is configured properly

Profile Summary: 1 successful control, 1 control failure, 0 controls skipped
Test Summary: 7 successful, 1 failure, 0 skipped
>>>>>> ------Exception-------
>>>>>> Class: Kitchen::ActionFailed
>>>>>> Message: 1 actions failed.
>>>>>>     Verify failed on instance <stub-domains-local>.  Please see .kitchen/logs/stub-domains-local.log for more details
>>>>>> ----------------------
>>>>>> Please see .kitchen/logs/kitchen.log for more details
>>>>>> Also try running `kitchen diagnose --all` for configuration

I have the .kitchen/logs/kitchen.log and kitchen diagnose --all output copied, so let me know if you need that.

Jberlinsky commented 5 years ago

@czka Could you kindly run the tests again and see if you're able to consistently reproduce this error? I just ran the stub-domains test and was unable to reproduce this issue.

czka commented 5 years ago

Tried the kitchen destroy -> create -> converge -> verify cycle 2 more times (once for stub-domains-local alone, and once more for the whole set of tests). Same issue keeps cropping out:

expected: {"horizontalPodAutoscaling"=>{}, "httpLoadBalancing"=>{}, "kubernetesDashboard"=>{"disabled"=>true}, "networkPolicyConfig"=>{}}
     got: {"horizontalPodAutoscaling"=>{}, "httpLoadBalancing"=>{}, "kubernetesDashboard"=>{"disabled"=>true}, "networkPolicyConfig"=>{"disabled"=>true}}

czka commented 5 years ago

@Jberlinsky

I have removed .kitchen/ and all the test/fixtures/*/.terraform/ dirs to start afresh. make test_integration_docker completed all fine in around 90 minutes.

But then I was able to reproduce the error again by running make docker_run, kitchen create stub-domains, kitchen converge stub-domains, kitchen verify stub-domains.

Jberlinsky commented 5 years ago

@czka Thanks for the update; I'll continue to try to reproduce. Can you tell me if running kitchen converge stub-domains twice, instead of just once, resolves the problem?

czka commented 5 years ago

BTW, is it as expected that root owns the following dirs created during the tests? I ran them as a regular user.

$ ls -ld .kitchen
drwxr-xr-x 3 root root 4096 Jan 14 16:41 .kitchen

$ find . -type d -name .terraform | xargs ls -ld
drwxr-xr-x 4 root root 4096 Jan 14 13:41 ./test/fixtures/deploy_service/.terraform
drwxr-xr-x 4 root root 4096 Jan 14 13:41 ./test/fixtures/node_pool/.terraform
drwxr-xr-x 4 root root 4096 Jan 14 13:41 ./test/fixtures/shared_vpc/.terraform
drwxr-xr-x 4 root root 4096 Jan 14 13:41 ./test/fixtures/simple_regional/.terraform
drwxr-xr-x 4 root root 4096 Jan 14 13:41 ./test/fixtures/simple_zonal/.terraform
drwxr-xr-x 4 root root 4096 Jan 14 13:41 ./test/fixtures/stub_domains/.terraform

$ ls -ld test/fixtures/stub_domains/terraform.tfstate.d/
drwxr-xr-x 3 root root 4096 Jan 14 16:41 test/fixtures/stub_domains/terraform.tfstate.d/

czka commented 5 years ago

@Jberlinsky kitchen verify stub-domains passed without errors after running kitchen converge stub-domains 2nd time.

aaron-lane commented 5 years ago

This seems like an issue within the API rather than the Terraform configuration. We may need to simply emphasize the requirement of applying twice a configuration using the module to obtain the expected results. We should also consider raising a ticket against the provider.

Jberlinsky commented 5 years ago

Agreed -- I'll file a PR today/tomorrow to emphasize the need to run kitchen converge twice.

Thanks for reporting this, @czka!

morgante commented 5 years ago

I'd like to do some more digging on why the converge needs to happen twice, it's probably not an issue with the API so much as the provider and/or our config.

In particular, we should note what the plan actually shows for the second converge.

czka commented 5 years ago

@morgante Only now I noticed a double kitchen converge is already hardcoded in the Makefile: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blame/master/Makefile#L79. A well known issue with a well known workaround ;).

morgante commented 5 years ago

@Jberlinsky Do you know why we need to converge twice though?

Jberlinsky commented 5 years ago

I don't recall the specific reason offhand, but the double converge has been present in this repository for quite some time (see https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/5cb2b8b31491db5521fdaa5a2ae105fc01e44bb6/test/integration/gcloud/run.sh#L333).

I'll dig into this a bit.

Jberlinsky commented 5 years ago

For the google_container_cluster.primary resource, the initial plan is as follows:

  + module.example.module.gke.google_container_cluster.primary
      id:                                                         <computed>
      additional_zones.#:                                         <computed>
      addons_config.#:                                            "1"
      addons_config.0.horizontal_pod_autoscaling.#:               "1"
      addons_config.0.horizontal_pod_autoscaling.0.disabled:      "false"
      addons_config.0.http_load_balancing.#:                      "1"
      addons_config.0.http_load_balancing.0.disabled:             "false"
      addons_config.0.kubernetes_dashboard.#:                     "1"
      addons_config.0.kubernetes_dashboard.0.disabled:            "true"
      addons_config.0.network_policy_config.#:                    "1"
      addons_config.0.network_policy_config.0.disabled:           "false"
      cluster_ipv4_cidr:                                          <computed>
      enable_binary_authorization:                                "false"
      enable_kubernetes_alpha:                                    "false"
      enable_legacy_abac:                                         "false"
      enable_tpu:                                                 "false"
      endpoint:                                                   <computed>
      instance_group_urls.#:                                      <computed>
      ip_allocation_policy.#:                                     "1"
      ip_allocation_policy.0.cluster_ipv4_cidr_block:             <computed>
      ip_allocation_policy.0.cluster_secondary_range_name:        "${var.ip_range_pods}"
      ip_allocation_policy.0.services_ipv4_cidr_block:            <computed>
      ip_allocation_policy.0.services_secondary_range_name:       "${var.ip_range_services}"
      logging_service:                                            "logging.googleapis.com"
      maintenance_policy.#:                                       "1"
      maintenance_policy.0.daily_maintenance_window.#:            "1"
      maintenance_policy.0.daily_maintenance_window.0.duration:   <computed>
      maintenance_policy.0.daily_maintenance_window.0.start_time: "05:00"
      master_auth.#:                                              <computed>
      master_version:                                             <computed>
      min_master_version:                                         "1.11.5-gke.5"
      monitoring_service:                                         "monitoring.googleapis.com"
      name:                                                       "${var.name}"
      network:                                                    "${replace(data.google_compute_network.gke_network.self_link, \"https://www.googleapis.com/compute/v1/\", \"\")}"
      network_policy.#:                                           <computed>
      node_config.#:                                              <computed>
      node_pool.#:                                                "1"
      node_pool.0.initial_node_count:                             <computed>
      node_pool.0.instance_group_urls.#:                          <computed>
      node_pool.0.management.#:                                   <computed>
      node_pool.0.max_pods_per_node:                              <computed>
      node_pool.0.name:                                           "default-pool"
      node_pool.0.name_prefix:                                    <computed>
      node_pool.0.node_config.#:                                  "1"
      node_pool.0.node_config.0.disk_size_gb:                     <computed>
      node_pool.0.node_config.0.disk_type:                        <computed>
      node_pool.0.node_config.0.guest_accelerator.#:              <computed>
      node_pool.0.node_config.0.image_type:                       <computed>
      node_pool.0.node_config.0.local_ssd_count:                  <computed>
      node_pool.0.node_config.0.machine_type:                     <computed>
      node_pool.0.node_config.0.oauth_scopes.#:                   <computed>
      node_pool.0.node_config.0.preemptible:                      "false"
      node_pool.0.node_config.0.service_account:                  "project-service-account@berlinsky-pf-gke-fixture-f466.iam.gserviceaccount.com"
      node_pool.0.node_count:                                     <computed>
      node_pool.0.version:                                        <computed>
      node_version:                                               <computed>
      private_cluster:                                            "false"
      project:                                                    "berlinsky-pf-gke-fixture-f466"
      region:                                                     "us-east4"
      remove_default_node_pool:                                   "false"
      subnetwork:                                                 "${replace(data.google_compute_subnetwork.gke_subnetwork.self_link, \"https://www.googleapis.com/compute/v1/\", \"\")}"
      zone:                                                       <computed>

After the first terraform apply, the relevant terraform plan is as follows:

  ~ module.example.module.gke.google_container_cluster.primary
      addons_config.#:                                       "1" => "1"
      addons_config.0.horizontal_pod_autoscaling.#:          "1" => "1"
      addons_config.0.horizontal_pod_autoscaling.0.disabled: "false" => "false"
      addons_config.0.http_load_balancing.#:                 "1" => "1"
      addons_config.0.http_load_balancing.0.disabled:        "false" => "false"
      addons_config.0.kubernetes_dashboard.#:                "1" => "1"
      addons_config.0.kubernetes_dashboard.0.disabled:       "true" => "true"
      addons_config.0.network_policy_config.#:               "1" => "1"
      addons_config.0.network_policy_config.0.disabled:      "true" => "false"

This change takes a fairly long time to apply (~13 min, just now), and does not result in a permadiff.

I'm continuing to dig in a bit, but it's looking like an API-level problem.

Jberlinsky commented 5 years ago

I've created a cluster via the API with the following payload:

{
  "cluster": {
    "addonsConfig": {
      "horizontalPodAutoscaling": {
        "disabled": false
      },
      "httpLoadBalancing": {
        "disabled": false
      },
      "kubernetesDashboard": {
        "disabled": true
      },
      "networkPolicyConfig": {
        "disabled": false
      }
    },
    "binaryAuthorization": {
      "enabled": false
    },
    "initialClusterVersion": "1.11.6-gke.2",
    "ipAllocationPolicy": {
      "clusterSecondaryRangeName": "cft-gke-test-pods-938k",
      "servicesSecondaryRangeName": "cft-gke-test-services-938k",
      "useIpAliases": true
    },
    "legacyAbac": {
      "enabled": false
    },
    "locations": [
      "us-east4-a",
      "us-east4-c",
      "us-east4-b"
    ],
    "loggingService": "logging.googleapis.com",
    "maintenancePolicy": {
      "window": {
        "dailyMaintenanceWindow": {
          "startTime": "05:00"
        }
      }
    },
    "monitoringService": "monitoring.googleapis.com",
    "name": "stub-domains-cluster-12s2",
    "network": "projects/berlinsky-pf-gke-fixture-f466/global/networks/cft-gke-test-938k",
    "nodePools": [
      {
        "config": {
          "serviceAccount": "project-service-account@berlinsky-pf-gke-fixture-f466.iam.gserviceaccount.com"
        },
        "name": "default-pool"
      }
    ],
    "subnetwork": "projects/berlinsky-pf-gke-fixture-f466/regions/us-east4/subnetworks/cft-gke-test-938k"
  }
}

Once the cluster is created, I query it with gcloud, and find the same problem:

╰ gcloud --project=berlinsky-pf-gke-fixture-f466 container clusters --zone=us-east4 describe stub-domains-cluster-12s2 --format=json | jq '.addonsConfig.networkPolicyConfig'
{
  "disabled": true
}

Looks like an API-level problem to me, unfortunately.

morgante commented 5 years ago

@Jberlinsky Can you file an internal bug with the details and I'll route it appropriately?

Jberlinsky commented 5 years ago

I've filed an internal bug, and submitted #71 to make the README more explicit on this matter.

morgante commented 5 years ago

Closing in favor of #72.

terraform-google-modules / terraform-google-kubernetes-engine

stub_domains test failed #68