inlets / inlets-operator

Get public TCP LoadBalancers for local Kubernetes clusters
https://docs.inlets.dev/reference/inlets-operator
MIT License
1.35k stars 98 forks source link

GCE provisioner does not show when the VM fails to be created #121

Closed maelvls closed 3 years ago

maelvls commented 3 years ago

Using the GCE provisioner, it seems like the tunnel VM sometimes never gets created, and the inlets-operator or events on the Tunnel objects do not give an indication as to what went wrong.

As I understand it, the provisioning is done in two steps:

After looking at how the provisioning is done, I noticed in gce.go that we don't check the status of the operation after creating the operation:

op, err := p.gceProvisioner.Instances.Insert(
    host.Additional["projectid"],
    host.Additional["zone"],
    instance).Do()

// This err is only meant for network or failure-related errors;
// VM provisioning errors are not returned in this error. Instead,
// we have to look at "op"
// and wait for the operation to complete.
if err != nil {
    return nil, fmt.Errorf("could not provision GCE instance: %s", err)
}

if op.HTTPStatusCode == http.StatusConflict {
    log.Println("Host already exists, status: conflict.")
}

The "actual" error is never displayed to the user, since we do not keep track of the provisioning operation op. If anything goes wrong, here is what the user sees in the inlets-operator controller logs (see full logs at the bottom of this issue):

error syncing kube-system/traefik-tunnel: could not get instance: googleapi: Error 404: The resource 'projects/jetstack-mael-valais/zones/europe-west2-b/instances/traefik-tunnel' was not found, notFound, requeuing

Looking at the GCP audit logs, the reason for this seems to be that the instance name starts with a -:

invalid-parameter-gcp-creation-vm

I can think of two ways of remediating that:

Idea 2 could be done with some polling mechanism such as

err := wait.Poll(5*time.Second, 300*time.Second, func() (bool, error) {
    op, err := google.zoneOperationsService.Get(project, zone, name).Do()
    if err != nil {
        return false, err
    }
    klog.V(6).Infof("Waiting for operation to be completed... (status: %s)", op.Status)
    if op.Status == "DONE" {
        if op.Error == nil {
            return true, nil
        }
        var err []error
        for _, opErr := range op.Error.Errors {
            err = append(err, fmt.Errorf("%s", *opErr))
        }
        return false, fmt.Errorf("the following errors occurred: %+v", err)
    }
    return false, nil
})

The inlets-operator command arguments:

# kubectl describe pod -l app.kubernetes.io/name=inlets-operator
Containers:
  inlets-operator:
    Image:         ghcr.io/inlets/inlets-operator:0.12.1
    Command:
      ./inlets-operator
      -provider=gce
      -zone=europe-west2-b
      -region=lon1                     # ooops, reminder of when I used digitalocean
      -access-key-file=/var/secrets/inlets/inlets-access-key
      -license=REDACTED

The inlets-operator logs:

# kubectl logs -l app.kubernetes.io/name=inlets-operator --tail=-1
2021/05/19 11:50:24 Operator version: 0.12.1 SHA: b3a96cc192b97afc862087260e97ad3bc2f2491b
2021/05/19 11:50:24 Inlets client: ghcr.io/inlets/inlets-pro:0.8.3
2021/05/19 11:50:24 Using inlets PRO.
W0519 11:50:24.391900       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0519 11:50:24.392512       1 controller.go:121] Setting up event handlers
I0519 11:50:24.392542       1 controller.go:243] Starting Tunnel controller
I0519 11:50:24.392546       1 controller.go:246] Waiting for informer caches to sync
I0519 11:50:24.492658       1 controller.go:251] Starting workers
I0519 11:50:24.492679       1 controller.go:257] Started workers
2021/05/19 11:50:24 Creating tunnel for traefik-tunnel.kube-system
I0519 11:50:24.500790       1 controller.go:315] Successfully synced 'kube-system/traefik'
2021/05/19 11:50:24 Provisioning started with provider:gce host:traefik-tunnel
2021/05/19 11:50:25 Creating firewall exists, updating: inlets
2021/05/19 11:50:27 Provisioning call took: 2.532041s
2021/05/19 11:50:27 Status (traefik): provisioning, ID: traefik-tunnel|europe-west2-b|jetstack-mael-valais, IP: 
I0519 11:50:27.046555       1 controller.go:315] Successfully synced 'kube-system/traefik-tunnel'
I0519 11:50:27.349417       1 controller.go:315] Successfully synced 'kube-system/traefik-tunnel'
E0519 11:50:54.677483       1 controller.go:320] error syncing 'kube-system/traefik-tunnel': could not get instance: googleapi: Error 404: The resource 'projects/jetstack-mael-valais/zones/europe-west2-b/instances/traefik-tunnel' was not found, notFound, requeuing
E0519 11:50:54.997725       1 controller.go:320] error syncing 'kube-system/traefik-tunnel': could not get instance: googleapi: Error 404: The resource 'projects/jetstack-mael-valais/zones/europe-west2-b/instances/traefik-tunnel' was not found, notFound, requeuing
E0519 11:50:55.284851       1 controller.go:320] error syncing 'kube-system/traefik-tunnel': could not get instance: googleapi: Error 404: The resource 'projects/jetstack-mael-valais/zones/europe-west2-b/instances/traefik-tunnel' was not found, notFound, requeuing

The traefik-tunnel object:

# kubectl describe tunnel -n kube-system traefik-tunnel
Name:         traefik-tunnel
Spec:
  auth_token:    foo
  Service Name:  traefik
Status:
  Host Id:      traefik-tunnel|europe-west2-b|jetstack-mael-valais
  Host Status:  provisioning
Events:
  Type    Reason  Age                From             Message
  ----    ------  ----               ----             -------
  Normal  Synced  33m (x2 over 33m)  inlets-operator  Tunnel synced successfully
alexellis commented 3 years ago

@utsavanand2 can you take a look please?

alexellis commented 3 years ago

Hi Mael,

Thanks for the feedback here. I've pinged Ustav who created and maintains the GCE code. Hopefully we can get you up and running again soon.

Using the GCE provisioner, it seems like the tunnel VM sometimes never gets created, and the inlets-operator or events on the Tunnel objects do not give an indication as to what went wrong.

The GCE provisioner has worked in the past, and I'm not aware of any regression, but it doesn't mean that it couldn't have issues.

One of the things that we saw with the Azure provider was a collision on naming, so we included an additional variable in the Host ID. It may be that you've stumbled on a way to create a clashing key or a configuration that we haven't tested yet.

Alex

alexellis commented 3 years ago

I think I prefer idea 1, we deliberately wanted the provisioning loop to be as fast as possible.

Idea 1 "asynchronously": remember what the operation name is e.g. in some annotation and to use that on each controller sync when the state is "provisioning";

1) Could we use the host ID field during provisioning and then overwrite it later with the status?

2) If we can prevent invalid names being used for the "op", would that also solve the issue?

3) Were you able to get past this issue?

4) Were you reliably able to reproduce the problem you faced here? Do you know what caused the insert ID to be invalid with a - prefix?

Alex

alexellis commented 3 years ago

@jsiebens have you run into this at all?

jsiebens commented 3 years ago

@jsiebens have you run into this at all?

Haven't seen that before, but I would be happy to jump in and have a closer look as soon as I can

jsiebens commented 3 years ago

@maelvls @alexellis

I've spent some time creating and destroying tunnels over and over again on GCE with the inlets-operator and inletsctl, and for the moment I'm not able to reproduce the issue (yet).

The highlighted insertId is not causing the issue I think, as it is not the instance name but rather an id generated by google for that operation. I've some successful logs with a similar id (starting with a -)

2021-05-20_12-08

@maelvls If possible, can you share some more information about your current setup?

maelvls commented 3 years ago

You are right, insertId: -gcyv1ed5hw4 was totally fine. The issue seemed to be lying somewhere else.

I recorded a screencast here. During this screencast, I discovered that I could see much more information on the operation by running

gcloud compute operations describe operation-1621584555501-5c2d2936766eb-545bd826-a4860591 --zone europe-west2-b

which would show:

endTime: '2021-05-21T01:09:22.107-07:00'
error:
  errors:
  - code: SERVICE_ACCOUNT_ACCESS_DENIED
    message: "The user does not have access to service account '1096313337776-compute@developer.gserviceaccount.com'.\
      \  User: 'give-me-my-cluster@jetstack-mael-valais.iam.gserviceaccount.com'.\
      \  Ask a project owner to grant you the iam.serviceAccountUser role on the service\
      \ account"
httpErrorMessage: BAD REQUEST
httpErrorStatusCode: 400
id: '4416278122182923331'
insertTime: '2021-05-21T01:09:16.463-07:00'
kind: compute#operation
name: operation-1621584555501-5c2d2936766eb-545bd826-a4860591
operationType: insert
progress: 100
selfLink: https://www.googleapis.com/compute/v1/projects/jetstack-mael-valais/zones/europe-west2-b/operations/operation-1621584555501-5c2d2936766eb-545bd826-a4860591
startTime: '2021-05-21T01:09:16.464-07:00'
status: DONE
targetId: '4268845720410969156'
targetLink: https://www.googleapis.com/compute/v1/projects/jetstack-mael-valais/zones/europe-west2-b/instances/traefik-tunnel
user: give-me-my-cluster@jetstack-mael-valais.iam.gserviceaccount.com
warnings:
- code: DEPRECATED_RESOURCE_USED
  data:
  - key: resource_name
    value: projects/debian-cloud/global/images/debian-9-stretch-v20191121
  - key: replacement_suggestion
    value: A suggested replacement is 'projects/debian-cloud/global/images/debian-9-stretch-v20191210'.
  message: The resource 'projects/debian-cloud/global/images/debian-9-stretch-v20191121'
    is deprecated. A suggested replacement is 'projects/debian-cloud/global/images/debian-9-stretch-v20191210'.
zone: https://www.googleapis.com/compute/v1/projects/jetstack-mael-valais/zones/europe-west2-b

It seems like my service account give-me-my-cluster did not have the iam.serviceAccountUser role.

I added the role and recreated the tunnel and the error is now gone! 🎉🎉

maelvls commented 3 years ago

Back to the Idea 1 "synchronous" above: it would be nice to have those error messages shown in the tunnel events too, E.g.,

# k describe tunnel -n kube-system traefik-tunnel
Name:         traefik-tunnel
Events:
  Type    Reason      Age                   From             Message
  ----    ------      ----                  ----             -------
  Warning GCEErr      9m48s                 inlets-operator  The user does not have access to service account '1096313337776-compute@developer.gserviceaccount.com'. User: 'give-me-my-cluster@jetstack-mael-valais.iam.gserviceaccount.com'. Ask a project owner to grant you the iam.serviceAccountUser role on the service account.
  Normal  Synced      14s (x11 over 3m12s)  inlets-operator  Tunnel synced successfully
alexellis commented 3 years ago

Let's get the documentation updated so people create the SA in the correct way?

maelvls commented 3 years ago

I think the documentation is already correct, it is just that I forgot to RTFM 😅

From https://docs.inlets.dev/#/tools/inlets-operator?id=create-tunnel-servers-on-google-compute-engine-gce

gcloud projects add-iam-policy-binding --role roles/compute.admin ...
gcloud projects add-iam-policy-binding --role roles/iam.serviceAccountUser ...

Apologies for not having read the docs 😞