Closed czka closed 5 years ago
@czka Could you kindly run the tests again and see if you're able to consistently reproduce this error? I just ran the stub-domains
test and was unable to reproduce this issue.
Tried the kitchen
destroy
-> create
-> converge
-> verify
cycle 2 more times (once for stub-domains-local
alone, and once more for the whole set of tests). Same issue keeps cropping out:
expected: {"horizontalPodAutoscaling"=>{}, "httpLoadBalancing"=>{}, "kubernetesDashboard"=>{"disabled"=>true}, "networkPolicyConfig"=>{}}
got: {"horizontalPodAutoscaling"=>{}, "httpLoadBalancing"=>{}, "kubernetesDashboard"=>{"disabled"=>true}, "networkPolicyConfig"=>{"disabled"=>true}}
@Jberlinsky
I have removed .kitchen/
and all the test/fixtures/*/.terraform/
dirs to start afresh. make test_integration_docker
completed all fine in around 90 minutes.
But then I was able to reproduce the error again by running make docker_run
, kitchen create stub-domains
, kitchen converge stub-domains
, kitchen verify stub-domains
.
@czka Thanks for the update; I'll continue to try to reproduce. Can you tell me if running kitchen converge stub-domains
twice, instead of just once, resolves the problem?
BTW, is it as expected that root
owns the following dirs created during the tests? I ran them as a regular user.
$ ls -ld .kitchen
drwxr-xr-x 3 root root 4096 Jan 14 16:41 .kitchen
$ find . -type d -name .terraform | xargs ls -ld
drwxr-xr-x 4 root root 4096 Jan 14 13:41 ./test/fixtures/deploy_service/.terraform
drwxr-xr-x 4 root root 4096 Jan 14 13:41 ./test/fixtures/node_pool/.terraform
drwxr-xr-x 4 root root 4096 Jan 14 13:41 ./test/fixtures/shared_vpc/.terraform
drwxr-xr-x 4 root root 4096 Jan 14 13:41 ./test/fixtures/simple_regional/.terraform
drwxr-xr-x 4 root root 4096 Jan 14 13:41 ./test/fixtures/simple_zonal/.terraform
drwxr-xr-x 4 root root 4096 Jan 14 13:41 ./test/fixtures/stub_domains/.terraform
$ ls -ld test/fixtures/stub_domains/terraform.tfstate.d/
drwxr-xr-x 3 root root 4096 Jan 14 16:41 test/fixtures/stub_domains/terraform.tfstate.d/
@Jberlinsky kitchen verify stub-domains
passed without errors after running kitchen converge stub-domains
2nd time.
This seems like an issue within the API rather than the Terraform configuration. We may need to simply emphasize the requirement of applying twice a configuration using the module to obtain the expected results. We should also consider raising a ticket against the provider.
Agreed -- I'll file a PR today/tomorrow to emphasize the need to run kitchen converge
twice.
Thanks for reporting this, @czka!
I'd like to do some more digging on why the converge needs to happen twice, it's probably not an issue with the API so much as the provider and/or our config.
In particular, we should note what the plan actually shows for the second converge.
@morgante Only now I noticed a double kitchen converge
is already hardcoded in the Makefile
: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blame/master/Makefile#L79. A well known issue with a well known workaround ;).
@Jberlinsky Do you know why we need to converge twice though?
I don't recall the specific reason offhand, but the double converge has been present in this repository for quite some time (see https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/5cb2b8b31491db5521fdaa5a2ae105fc01e44bb6/test/integration/gcloud/run.sh#L333).
I'll dig into this a bit.
For the google_container_cluster.primary
resource, the initial plan
is as follows:
+ module.example.module.gke.google_container_cluster.primary
id: <computed>
additional_zones.#: <computed>
addons_config.#: "1"
addons_config.0.horizontal_pod_autoscaling.#: "1"
addons_config.0.horizontal_pod_autoscaling.0.disabled: "false"
addons_config.0.http_load_balancing.#: "1"
addons_config.0.http_load_balancing.0.disabled: "false"
addons_config.0.kubernetes_dashboard.#: "1"
addons_config.0.kubernetes_dashboard.0.disabled: "true"
addons_config.0.network_policy_config.#: "1"
addons_config.0.network_policy_config.0.disabled: "false"
cluster_ipv4_cidr: <computed>
enable_binary_authorization: "false"
enable_kubernetes_alpha: "false"
enable_legacy_abac: "false"
enable_tpu: "false"
endpoint: <computed>
instance_group_urls.#: <computed>
ip_allocation_policy.#: "1"
ip_allocation_policy.0.cluster_ipv4_cidr_block: <computed>
ip_allocation_policy.0.cluster_secondary_range_name: "${var.ip_range_pods}"
ip_allocation_policy.0.services_ipv4_cidr_block: <computed>
ip_allocation_policy.0.services_secondary_range_name: "${var.ip_range_services}"
logging_service: "logging.googleapis.com"
maintenance_policy.#: "1"
maintenance_policy.0.daily_maintenance_window.#: "1"
maintenance_policy.0.daily_maintenance_window.0.duration: <computed>
maintenance_policy.0.daily_maintenance_window.0.start_time: "05:00"
master_auth.#: <computed>
master_version: <computed>
min_master_version: "1.11.5-gke.5"
monitoring_service: "monitoring.googleapis.com"
name: "${var.name}"
network: "${replace(data.google_compute_network.gke_network.self_link, \"https://www.googleapis.com/compute/v1/\", \"\")}"
network_policy.#: <computed>
node_config.#: <computed>
node_pool.#: "1"
node_pool.0.initial_node_count: <computed>
node_pool.0.instance_group_urls.#: <computed>
node_pool.0.management.#: <computed>
node_pool.0.max_pods_per_node: <computed>
node_pool.0.name: "default-pool"
node_pool.0.name_prefix: <computed>
node_pool.0.node_config.#: "1"
node_pool.0.node_config.0.disk_size_gb: <computed>
node_pool.0.node_config.0.disk_type: <computed>
node_pool.0.node_config.0.guest_accelerator.#: <computed>
node_pool.0.node_config.0.image_type: <computed>
node_pool.0.node_config.0.local_ssd_count: <computed>
node_pool.0.node_config.0.machine_type: <computed>
node_pool.0.node_config.0.oauth_scopes.#: <computed>
node_pool.0.node_config.0.preemptible: "false"
node_pool.0.node_config.0.service_account: "project-service-account@berlinsky-pf-gke-fixture-f466.iam.gserviceaccount.com"
node_pool.0.node_count: <computed>
node_pool.0.version: <computed>
node_version: <computed>
private_cluster: "false"
project: "berlinsky-pf-gke-fixture-f466"
region: "us-east4"
remove_default_node_pool: "false"
subnetwork: "${replace(data.google_compute_subnetwork.gke_subnetwork.self_link, \"https://www.googleapis.com/compute/v1/\", \"\")}"
zone: <computed>
After the first terraform apply
, the relevant terraform plan
is as follows:
~ module.example.module.gke.google_container_cluster.primary
addons_config.#: "1" => "1"
addons_config.0.horizontal_pod_autoscaling.#: "1" => "1"
addons_config.0.horizontal_pod_autoscaling.0.disabled: "false" => "false"
addons_config.0.http_load_balancing.#: "1" => "1"
addons_config.0.http_load_balancing.0.disabled: "false" => "false"
addons_config.0.kubernetes_dashboard.#: "1" => "1"
addons_config.0.kubernetes_dashboard.0.disabled: "true" => "true"
addons_config.0.network_policy_config.#: "1" => "1"
addons_config.0.network_policy_config.0.disabled: "true" => "false"
This change takes a fairly long time to apply (~13 min, just now), and does not result in a permadiff.
I'm continuing to dig in a bit, but it's looking like an API-level problem.
I've created a cluster via the API with the following payload:
{
"cluster": {
"addonsConfig": {
"horizontalPodAutoscaling": {
"disabled": false
},
"httpLoadBalancing": {
"disabled": false
},
"kubernetesDashboard": {
"disabled": true
},
"networkPolicyConfig": {
"disabled": false
}
},
"binaryAuthorization": {
"enabled": false
},
"initialClusterVersion": "1.11.6-gke.2",
"ipAllocationPolicy": {
"clusterSecondaryRangeName": "cft-gke-test-pods-938k",
"servicesSecondaryRangeName": "cft-gke-test-services-938k",
"useIpAliases": true
},
"legacyAbac": {
"enabled": false
},
"locations": [
"us-east4-a",
"us-east4-c",
"us-east4-b"
],
"loggingService": "logging.googleapis.com",
"maintenancePolicy": {
"window": {
"dailyMaintenanceWindow": {
"startTime": "05:00"
}
}
},
"monitoringService": "monitoring.googleapis.com",
"name": "stub-domains-cluster-12s2",
"network": "projects/berlinsky-pf-gke-fixture-f466/global/networks/cft-gke-test-938k",
"nodePools": [
{
"config": {
"serviceAccount": "project-service-account@berlinsky-pf-gke-fixture-f466.iam.gserviceaccount.com"
},
"name": "default-pool"
}
],
"subnetwork": "projects/berlinsky-pf-gke-fixture-f466/regions/us-east4/subnetworks/cft-gke-test-938k"
}
}
Once the cluster is created, I query it with gcloud
, and find the same problem:
╰ gcloud --project=berlinsky-pf-gke-fixture-f466 container clusters --zone=us-east4 describe stub-domains-cluster-12s2 --format=json | jq '.addonsConfig.networkPolicyConfig'
{
"disabled": true
}
Looks like an API-level problem to me, unfortunately.
@Jberlinsky Can you file an internal bug with the details and I'll route it appropriately?
I've filed an internal bug, and submitted #71 to make the README more explicit on this matter.
Closing in favor of #72.
As of current master (3f7527e583ffa07e6a06250844e07c38556a4488 when writing this), with a following
test/fixtures/shared/terraform.tfvars
:make docker_build_kitchen_terraform
,make docker_run
,kitchen create
andkitchen converge
passed fine.kitchen verify
passed fine fordeploy_service
,node_pool
,shared_vpc
,simple_regional
andsimple_zonal
. It failed atstub_domains
as follows:I have the
.kitchen/logs/kitchen.log
andkitchen diagnose --all
output copied, so let me know if you need that.