googleforgames / global-multiplayer-demo

This multiplayer demo is a cloud first implementation of a global scale, realtime multiplayer game utilising dedicated game servers, utilising both Google Cloud's products and open source gaming solutions.
Apache License 2.0
117 stars 16 forks source link

compute default service account does not have access to the global-game-images artifact registry repo #161

Open AlexBulankou opened 1 year ago

AlexBulankou commented 1 year ago

After following the demo steps I noticed that initially many workloads are left not initialized, because compute default service account (project_number-compute@developer.gserviceaccount.com) cannot pull the images as it does not have permissions to read from this registry. I fixed it manually but the IAM might be worth including in the Terraform configuration.

abmarcum commented 1 year ago

Have created a PR that will give the Compute Service Account the Artifact Repo reader role.

zmerlynn commented 1 year ago

162 to cross-reference ^

markmandel commented 1 year ago

Curious on something - is this a role that the compute instance would get by default when you enabled GKE? I've never had to manually enable this on any project 🤔 so why did this happen here?

I'm wondering if #162 is actually just hiding a race condition on the GKE cluster, or am I off base?

markmandel commented 1 year ago

Actually, lemme rephrase -- should this be the GKE cluster have a depends_on the K8s APi being fully enabled?

@AlexBulankou can you share the exact input and output you were getting please? Was it an error in the Terraform, a specific image? Something else?

AlexBulankou commented 1 year ago

I did not get any deployment errors, but the container could not pull the image before I added access explicitly. Not an expert, but intuitively I would be surprised if an registry created would have compute default service account by default, because it means that any cluster in the project has this access by default, not sure if this is desired behavior for many organizations (vs. enabling a dedicated service account for a given registry).

markmandel commented 1 year ago

I think this is fixed now, but to confirm:

I did not get any deployment errors, but the container could not pull the image

Sorry, not sure I'm following - containers don't pull images. Do you were seeing Image Pull Backoffs in your GKE clusters? If so, which clusters? All of them? Some of them?

Which workloads, which Deployments, which clusters. Did some work, did others not? Screenshots and details here would be very useful.

AlexBulankou commented 1 year ago

Do you were seeing Image Pull Backoffs in your GKE clusters? If so, which clusters? All of them? Some of them?

Yes. I was seeing it on game server workloads, I did not check if it was on all of them or some of them. here's an example:

{
  "insertId": "wlovxtp97bip59w8",
  "jsonPayload": {
    "_GID": "0",
    "PRIORITY": "6",
    "_PID": "1790",
    "SYSLOG_IDENTIFIER": "kubelet",
    "_SYSTEMD_UNIT": "kubelet.service",
    "_MACHINE_ID": "c6aa1e71abcbcf4326b3fdcbf82684e1",
    "_SYSTEMD_INVOCATION_ID": "12fccd8e939940818873f98ba85e7ae0",
    "_CAP_EFFECTIVE": "1ffffffffff",
    "_BOOT_ID": "0a3608b3b8544bf7b2f9fb860e66d631",
    "_UID": "0",
    "_SYSTEMD_CGROUP": "/system.slice/kubelet.service",
    "_SYSTEMD_SLICE": "system.slice",
    "_TRANSPORT": "stdout",
    "_COMM": "kubelet",
    "MESSAGE": "E0325 18:59:44.857798    1790 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"droidshooter\\\" with ImagePullBackOff: \\\"Back-off pulling image \\\\\\\"us-docker.pkg.dev/alexbu-gke-dev/global-game-images/droidshooter-server:b40b146a-8390-4569-abd7-abd5c509b1ec\\\\\\\"\\\"\" pod=\"default/droidshooter-bzlbw-qpjqv\" podUID=8d5da6d4-68d1-4c84-85d2-8407a9581739",
    "_HOSTNAME": "gk3-global-game-us-centr-nap-10413t6d-18671094-nxpq",
    "_CMDLINE": "/home/kubernetes/bin/kubelet --v=2 --cloud-provider=gce --experimental-mounter-path=/home/kubernetes/containerized_mounter/mounter --cert-dir=/var/lib/kubelet/pki/ --kubeconfig=/var/lib/kubelet/kubeconfig --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 --max-pods=32 --volume-plugin-dir=/home/kubernetes/flexvolume --node-status-max-images=25 --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --runtime-cgroups=/system.slice/containerd.service --registry-qps=10 --registry-burst=20 --config /home/kubernetes/kubelet-config.yaml \"--pod-sysctls=net.core.somaxconn=1024,net.ipv4.conf.all.accept_redirects=0,net.ipv4.conf.all.forwarding=1,net.ipv4.conf.all.route_localnet=1,net.ipv4.conf.default.forwarding=1,net.ipv4.ip_forward=1,net.ipv4.tcp_fin_timeout=60,net.ipv4.tcp_keepalive_intvl=60,net.ipv4.tcp_keepalive_probes=5,net.ipv4.tcp_keepalive_time=300,net.ipv4.tcp_rmem=4096 87380 6291456,net.ipv4.tcp_syn_retries=6,net.ipv4.tcp_tw_reuse=0,net.ipv4.tcp_wmem=4096 16384 4194304,net.ipv4.udp_rmem_min=4096,net.ipv4.udp_wmem_min=4096,net.ipv6.conf.all.disable_ipv6=1,net.ipv6.conf.default.accept_ra=0,net.ipv6.conf.default.disable_ipv6=1,net.netfilter.nf_conntrack_generic_timeout=600,net.netfilter.nf_conntrack_tcp_be_liberal=1,net.netfilter.nf_conntrack_tcp_timeout_close_wait=3600,net.netfilter.nf_conntrack_tcp_timeout_established=86400\" --pod-infra-container-image=gke.gcr.io/pause:3.6@sha256:10008c36b4611b44db1229451675d5d7d86c7ddf4ef00f883d806a01547203f6",
    "_STREAM_ID": "1423d9289b624b53b7196a781694f575",
    "_EXE": "/home/kubernetes/bin/kubelet",
    "SYSLOG_FACILITY": "3"
  },
  "resource": {
    "type": "k8s_node",
    "labels": {
      "node_name": "gk3-global-game-us-centr-nap-10413t6d-18671094-nxpq",
      "cluster_name": "global-game-us-central1-02",
      "location": "us-central1",
      "project_id": "alexbu-gke-dev"
    }
  },
  "timestamp": "2023-03-25T18:59:44.857881Z",
  "logName": "projects/alexbu-gke-dev/logs/kubelet",
  "receiveTimestamp": "2023-03-25T18:59:49.792357873Z"
}
{
  "insertId": "ezoa0uf99z2sz",
  "jsonPayload": {
    "kind": "Event",
    "apiVersion": "v1",
    "reportingInstance": "",
    "eventTime": null,
    "message": "Error: ImagePullBackOff",
    "reason": "Failed",
    "type": "Warning",
    "source": {
      "host": "gke-global-game-eu-west1-01-default-edbb1dd5-bdf8",
      "component": "kubelet"
    },
    "involvedObject": {
      "fieldPath": "spec.containers{droidshooter}",
      "uid": "8c645eaf-4f8f-4a9c-a467-e60a152aeb69",
      "name": "droidshooter-nmlfb-j9xwn",
      "kind": "Pod",
      "resourceVersion": "1774080",
      "apiVersion": "v1",
      "namespace": "default"
    },
    "lastTimestamp": "2023-03-25T18:59:45Z",
    "metadata": {
      "name": "droidshooter-nmlfb-j9xwn.174fbea1220c6344",
      "creationTimestamp": "2023-03-25T18:59:45Z",
      "namespace": "default",
      "resourceVersion": "38876",
      "managedFields": [
        {
          "fieldsV1": {
            "f:involvedObject": {},
            "f:type": {},
            "f:source": {
              "f:component": {},
              "f:host": {}
            },
            "f:lastTimestamp": {},
            "f:count": {},
            "f:reason": {},
            "f:firstTimestamp": {},
            "f:message": {}
          },
          "manager": "kubelet",
          "fieldsType": "FieldsV1",
          "operation": "Update",
          "apiVersion": "v1",
          "time": "2023-03-25T18:59:45Z"
        }
      ],
      "uid": "5d3d7ffd-d898-47da-b451-a57097419750"
    },
    "reportingComponent": ""
  },
  "resource": {
    "type": "k8s_pod",
    "labels": {
      "project_id": "alexbu-gke-dev",
      "location": "europe-west1",
      "namespace_name": "default",
      "cluster_name": "global-game-eu-west1-01",
      "pod_name": "droidshooter-nmlfb-j9xwn"
    }
  },
  "timestamp": "2023-03-25T18:59:45Z",
  "severity": "WARNING",
  "logName": "projects/alexbu-gke-dev/logs/events",
  "receiveTimestamp": "2023-03-25T18:59:45.747550125Z"
}
markmandel commented 1 year ago

So if I look at my global-game-images registry, I see this permission on it: image

Looking at your project I see the same permissions set on that registry permissions - so the compute service account should be able to read from the registry.

Looking at the permissions on the compute my project I see:

image

Weirdly, when I look at your compute service account... it doesn't match this, it's missing the one highlighted here.

Since we've merged #162 is that fixed now?

I'm also wondering what extra org policies you may have in effect that is different from a "standard" GCP project.