Closed maelvls closed 3 years ago
@utsavanand2 can you take a look please?
Hi Mael,
Thanks for the feedback here. I've pinged Ustav who created and maintains the GCE code. Hopefully we can get you up and running again soon.
Using the GCE provisioner, it seems like the tunnel VM sometimes never gets created, and the inlets-operator or events on the Tunnel objects do not give an indication as to what went wrong.
The GCE provisioner has worked in the past, and I'm not aware of any regression, but it doesn't mean that it couldn't have issues.
One of the things that we saw with the Azure provider was a collision on naming, so we included an additional variable in the Host ID. It may be that you've stumbled on a way to create a clashing key or a configuration that we haven't tested yet.
Alex
I think I prefer idea 1, we deliberately wanted the provisioning loop to be as fast as possible.
Idea 1 "asynchronously": remember what the operation name is e.g. in some annotation and to use that on each controller sync when the state is "provisioning";
1) Could we use the host ID field during provisioning and then overwrite it later with the status?
2) If we can prevent invalid names being used for the "op", would that also solve the issue?
3) Were you able to get past this issue?
4) Were you reliably able to reproduce the problem you faced here? Do you know what caused the insert ID to be invalid with a - prefix?
Alex
@jsiebens have you run into this at all?
@jsiebens have you run into this at all?
Haven't seen that before, but I would be happy to jump in and have a closer look as soon as I can
@maelvls @alexellis
I've spent some time creating and destroying tunnels over and over again on GCE with the inlets-operator and inletsctl, and for the moment I'm not able to reproduce the issue (yet).
The highlighted insertId
is not causing the issue I think, as it is not the instance name but rather an id generated by google for that operation. I've some successful logs with a similar id (starting with a -)
@maelvls If possible, can you share some more information about your current setup?
You are right, insertId: -gcyv1ed5hw4
was totally fine. The issue seemed to be lying somewhere else.
I recorded a screencast here. During this screencast, I discovered that I could see much more information on the operation by running
gcloud compute operations describe operation-1621584555501-5c2d2936766eb-545bd826-a4860591 --zone europe-west2-b
which would show:
endTime: '2021-05-21T01:09:22.107-07:00'
error:
errors:
- code: SERVICE_ACCOUNT_ACCESS_DENIED
message: "The user does not have access to service account '1096313337776-compute@developer.gserviceaccount.com'.\
\ User: 'give-me-my-cluster@jetstack-mael-valais.iam.gserviceaccount.com'.\
\ Ask a project owner to grant you the iam.serviceAccountUser role on the service\
\ account"
httpErrorMessage: BAD REQUEST
httpErrorStatusCode: 400
id: '4416278122182923331'
insertTime: '2021-05-21T01:09:16.463-07:00'
kind: compute#operation
name: operation-1621584555501-5c2d2936766eb-545bd826-a4860591
operationType: insert
progress: 100
selfLink: https://www.googleapis.com/compute/v1/projects/jetstack-mael-valais/zones/europe-west2-b/operations/operation-1621584555501-5c2d2936766eb-545bd826-a4860591
startTime: '2021-05-21T01:09:16.464-07:00'
status: DONE
targetId: '4268845720410969156'
targetLink: https://www.googleapis.com/compute/v1/projects/jetstack-mael-valais/zones/europe-west2-b/instances/traefik-tunnel
user: give-me-my-cluster@jetstack-mael-valais.iam.gserviceaccount.com
warnings:
- code: DEPRECATED_RESOURCE_USED
data:
- key: resource_name
value: projects/debian-cloud/global/images/debian-9-stretch-v20191121
- key: replacement_suggestion
value: A suggested replacement is 'projects/debian-cloud/global/images/debian-9-stretch-v20191210'.
message: The resource 'projects/debian-cloud/global/images/debian-9-stretch-v20191121'
is deprecated. A suggested replacement is 'projects/debian-cloud/global/images/debian-9-stretch-v20191210'.
zone: https://www.googleapis.com/compute/v1/projects/jetstack-mael-valais/zones/europe-west2-b
It seems like my service account give-me-my-cluster
did not have the iam.serviceAccountUser
role.
I added the role and recreated the tunnel and the error is now gone! 🎉🎉
Back to the Idea 1 "synchronous" above: it would be nice to have those error messages shown in the tunnel events too, E.g.,
# k describe tunnel -n kube-system traefik-tunnel
Name: traefik-tunnel
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning GCEErr 9m48s inlets-operator The user does not have access to service account '1096313337776-compute@developer.gserviceaccount.com'. User: 'give-me-my-cluster@jetstack-mael-valais.iam.gserviceaccount.com'. Ask a project owner to grant you the iam.serviceAccountUser role on the service account.
Normal Synced 14s (x11 over 3m12s) inlets-operator Tunnel synced successfully
Let's get the documentation updated so people create the SA in the correct way?
I think the documentation is already correct, it is just that I forgot to RTFM 😅
gcloud projects add-iam-policy-binding --role roles/compute.admin ...
gcloud projects add-iam-policy-binding --role roles/iam.serviceAccountUser ...
Apologies for not having read the docs 😞
Using the GCE provisioner, it seems like the tunnel VM sometimes never gets created, and the inlets-operator or events on the Tunnel objects do not give an indication as to what went wrong.
As I understand it, the provisioning is done in two steps:
Instances.Insert
and check that there is no 409 Conflict in case the name clashes;After looking at how the provisioning is done, I noticed in gce.go that we don't check the status of the operation after creating the operation:
The "actual" error is never displayed to the user, since we do not keep track of the provisioning operation
op
. If anything goes wrong, here is what the user sees in the inlets-operator controller logs (see full logs at the bottom of this issue):Looking at the GCP audit logs, the reason for this seems to be that the instance name starts with a
-
:I can think of two ways of remediating that:
Idea 2 could be done with some polling mechanism such as
The
inlets-operator
command arguments:The
inlets-operator
logs:The
traefik-tunnel
object: