giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

Transit gateway is not associated for private clusters in a different AWS account #1801

Closed alex-dabija closed 1 year ago

alex-dabija commented 1 year ago

Issue

The transit gateway is not associated to the workload cluster's VPC if the cluster is private and created in a different AWS account.

The cluster was created on golem with the following config:

---
apiVersion: v1
data:
  values: |
    aws:
      region: eu-west-2
      awsClusterRole: default2
    bastion:
      enabled: false
    proxy:
      enabled: true
      http_proxy: "http://internal-a1c90e5331e124481a14fb7ad80ae8eb-1778512673.eu-west-2.elb.amazonaws.com:4000"
      https_proxy: "http://internal-a1c90e5331e124481a14fb7ad80ae8eb-1778512673.eu-west-2.elb.amazonaws.com:4000"
      no_proxy: "test-domain.com"
    clusterName: alextest36
    controlPlane:
      replicas: 1
    machinePools:
    - instanceType: m5.xlarge
      maxSize: 10
      minSize: 3
      name: machine-pool0
      rootVolumeSizeGB: 300
      availabilityZones:
      - eu-west-2a
      - eu-west-2b
      - eu-west-2c
    network:
      vpcCIDR: 10.20.0.0/16
      topologyMode: GiantSwarmManaged
      availabilityZoneUsageLimit: 3
      vpcMode: private
      apiMode: private
      dnsMode: private
      subnets:
      - cidrBlock: 10.20.0.0/18
      - cidrBlock: 10.20.64.0/18
      - cidrBlock: 10.20.128.0/18
    organization: giantswarm
kind: ConfigMap
metadata:
  creationTimestamp: null
  labels:
    giantswarm.io/cluster: alextest36
  name: alextest36-userconfig
  namespace: org-giantswarm
---
apiVersion: application.giantswarm.io/v1alpha1
kind: App
metadata:
  labels:
    app-operator.giantswarm.io/version: 0.0.0
  name: alextest36
  namespace: org-giantswarm
spec:
  catalog: cluster
  config:
    configMap:
      name: ""
      namespace: ""
    secret:
      name: ""
      namespace: ""
  kubeConfig:
    context:
      name: ""
    inCluster: true
    secret:
      name: ""
      namespace: ""
  name: cluster-aws
  namespace: org-giantswarm
  userConfig:
    configMap:
      name: alextest36-userconfig
      namespace: org-giantswarm
  version: 0.20.2
---
apiVersion: v1
data:
  values: |
    clusterName: alextest36
    organization: giantswarm
kind: ConfigMap
metadata:
  creationTimestamp: null
  labels:
    giantswarm.io/cluster: alextest36
  name: alextest36-default-apps-userconfig
  namespace: org-giantswarm
---
apiVersion: application.giantswarm.io/v1alpha1
kind: App
metadata:
  labels:
    app-operator.giantswarm.io/version: 0.0.0
    giantswarm.io/cluster: alextest36
    giantswarm.io/managed-by: cluster
  name: alextest36-default-apps
  namespace: org-giantswarm
spec:
  catalog: cluster
  config:
    configMap:
      name: alextest36-cluster-values
      namespace: org-giantswarm
    secret:
      name: ""
      namespace: ""
  kubeConfig:
    context:
      name: ""
    inCluster: true
    secret:
      name: ""
      namespace: ""
  name: default-apps-aws
  namespace: org-giantswarm
  userConfig:
    configMap:
      name: alextest36-default-apps-userconfig
      namespace: org-giantswarm
  version: 0.12.3

The aws-network-topology-operator is saying that AWS can't find an existing subnet when the transit gateway is attached:

1.6710229132236247e+09  INFO    Reconciling     {"controller": "cluster", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Cluster", "cluster": {"name":"alextest36","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest36", "reconcileID": "6686176c-118c-48b7-9ae2-19166a0f4e64"}
1.6710229133360448e+09  INFO    transitgateway-registrar        Got TransitGateway      {"controller": "cluster", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Cluster", "cluster": {"name":"alextest36","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest36", "reconcileID": "6686176c-118c-48b7-9ae2-19166a0f4e64", "transitGatewayID": "tgw-034e681b2d0288423"}
1.6710229134554205e+09  ERROR   transitgateway-registrar        Failed to create transit gateway attachments    {"controller": "cluster", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Cluster", "cluster": {"name":"alextest36","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest36", "reconcileID": "6686176c-118c-48b7-9ae2-19166a0f4e64", "transitGatewayID": "tgw-034e681b2d0288423", "vpcID": "vpc-0e9d440dc6831956a", "error": "operation error EC2: CreateTransitGatewayVpcAttachment, https response error StatusCode: 400, RequestID: 93f6dbfb-b2de-454c-9913-7c93aabeb13f, api error InvalidSubnetID.NotFound: The subnet ID 'subnet-0545a88f6cbcbef1d' does not exist"}
1.6710229134735894e+09  INFO    Done reconciling        {"controller": "cluster", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Cluster", "cluster": {"name":"alextest36","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest36", "reconcileID": "6686176c-118c-48b7-9ae2-19166a0f4e64"}
1.6710229134736302e+09  ERROR   Reconciler error        {"controller": "cluster", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Cluster", "cluster": {"name":"alextest36","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest36", "reconcileID": "6686176c-118c-48b7-9ae2-19166a0f4e64", "error": "operation error EC2: CreateTransitGatewayVpcAttachment, https response error StatusCode: 400, RequestID: 93f6dbfb-b2de-454c-9913-7c93aabeb13f, api error InvalidSubnetID.NotFound: The subnet ID 'subnet-0545a88f6cbcbef1d' does not exist"}

The subnet subnet-0545a88f6cbcbef1d exists: image

All the operations in the operator are done from the perspective of the management cluster. There are only 2 places where the GetAWSClusterRoleIdentity function is called:

The transit gateway attachment needs to be executed from the workload cluster's perspective.

Resources

AndiDog commented 1 year ago

Note to the implementer – fix the misleading wording here

identityRef:
  kind: AWSClusterRoleIdentity
  name: {{ .Values.aws.awsClusterRole }}

(role vs. identity)

AndiDog commented 1 year ago

I can see that opsctl open -i golem -a cloudprovider --workload-cluster andreas2 doesn't take me to the right AWS account, so I may need to fix that as well, if even possible.

AndiDog commented 1 year ago

opsctl open -i golem -a cloudprovider --workload-cluster andreas2 seems to work but only once the AWSCluster exists of course, and it otherwise shows a warning and uses the default cluster role identity. So no need to fix – works already.

AndiDog commented 1 year ago

Confirmed deployed in production clusters. AliYun image push fails, but we think this does not matter right now and I created a follow-up: https://github.com/giantswarm/roadmap/issues/1822.

mnitchev commented 1 year ago

Reopening as the Transit Gateway needs to be shared with the WC account by the network-topology-operator

mnitchev commented 1 year ago

Released in aws-network-topology-operator@v1.5.0