solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.07k stars 437 forks source link

Fail rolloutJob with error argument list too long #7060

Closed gektor0856 closed 1 year ago

gektor0856 commented 2 years ago

Gloo Edge Version

1.11.x

Kubernetes Version

1.22.x

Describe the bug

I'm trying to upgrade gloo edge 1.11.19 to latest (1.11.32) and getting error in rolloutJob: standard_init_linux.go:228: exec user process caused: argument list too long

I'm using helm to upgrade gloo and gatewayProxies.gatewayProxy.gatewaySettings.customHttpsGateway option to set proto descriptor binary and it seems to me that this conflicts with rollout job

Rollout pod describe: kubectl describe po gloo-resource-rollout-wjq6g

Output ``` Name: gloo-resource-rollout-wjq6g Namespace: bai-infra Priority: 0 Node: minikube/192.168.49.2 Start Time: Wed, 31 Aug 2022 12:10:52 +0400 Labels: controller-uid=586960aa-bbc1-4393-91b3-3e8fda6e7862 gloo=resource-rollout job-name=gloo-resource-rollout Annotations: Status: Running IP: 172.17.0.3 IPs: IP: 172.17.0.3 Controlled By: Job/gloo-resource-rollout Containers: kubectl: Container ID: docker://11012b2f5038c402b7d7180c4362e9437df4ed16f9dcefcc042e00ce5281b7bc Image: quay.io/solo-io/kubectl:1.11.32 Image ID: docker-pullable://quay.io/solo-io/kubectl@sha256:4adba949565599f611054ef50f6916e77b59e41bece05e8026fcaacbd76386da Port: Host Port: Command: /bin/sh -c # if validation webhook is enabled, wait for deployment rollout so validation service will be available kubectl rollout status deployment -n bai-infra gateway # apply Gloo Edge custom resources kubectl apply -f - < Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vprhd (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kube-api-access-vprhd: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 17s default-scheduler Successfully assigned bai-infra/gloo-resource-rollout-wjq6g to minikube Normal Pulled 16s (x2 over 17s) kubelet Container image "quay.io/solo-io/kubectl:1.11.32" already present on machine Normal Created 16s (x2 over 17s) kubelet Created container kubectl Normal Started 16s (x2 over 17s) kubelet Started container kubectl Warning BackOff 14s (x2 over 15s) kubelet Back-off restarting failed container ```

Steps to reproduce the bug

  1. Set helm option gloo.gatewayProxies.gatewayProxy.gatewaySettings.customHttpGateway.options.grpcJsonTranscoder.protoDescriptorBin with large proto descriptor
  2. Try upgrade gloo using helm

Expected Behavior

rolloutJob should complete upgrade gateway

Additional Context

No response

jenshu commented 1 year ago

this can be fixed by implementing https://github.com/solo-io/gloo/issues/7495 and then referencing the configMap (instead of the full proto descriptor string) in the Gateway

jenshu commented 1 year ago

An initial fix, by way of implementing https://github.com/solo-io/gloo/issues/7495, is available in:

To use this in helm, old helm values in the shape of:

gloo:
  gatewayProxies:
    gatewayProxy:
      gatewaySettings:
        customHttpGateway:
          options:
            grpcJsonTranscoder:
              protoDescriptorBin: <proto-desc-content...>

should be changed to:

gloo:
  gatewayProxies:
    gatewayProxy:
      gatewaySettings:
        customHttpGateway:
          options:
            grpcJsonTranscoder:
              protoDescriptorConfigMap:
                configMapRef:
                  name: my-config-map
                  namespace: gloo-system  # optional, defaults to install namespace
                key: my-key  # optional if there is only one key in the configmap
global:
  configMaps:
   - name: my-config-map
     namespace: gloo-system
     data:
       my-key: <proto-desc-content...>

Will leave this issue open so we can explore a more generic fix that will any avoid future issues with the size of the gateway yaml.

jenshu commented 1 year ago

A more generic fix, which does not require the use of protoDescriptorConfigMap (i.e. you can keep using protoDescriptorBin) is available in:

Gloo OSS 1.14.0-beta7, 1.13.4, 1.12.42, and 1.11.50 Gloo EE 1.14.0-beta4, 1.13.5, 1.12.46, and 1.11.51

edit: per below comment, there may still be issues with protoDescriptorBin, so we suggest using the protoDescriptorConfigMap as outlined above

gektor0856 commented 1 year ago

@jenshu Great! Thanks! But i faced with new problem after upgrading gloo edge oss 1.11.19 to 1.11.50: rollout job failed with error in logs:

kubectl logs gloo-resource-rollout-fmflx
deployment "gateway" successfully rolled out
Warning: resource gateways/gateway-proxy is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
The Gateway "gateway-proxy" is invalid: metadata.annotations: Too long: must have at most 262144 bytes
Describe log ``` kubectl describe po gloo-resource-rollout-fmflx Name: gloo-resource-rollout-fmflx Namespace: bai-infra Priority: 0 Node: minikube/192.168.49.2 Start Time: Tue, 07 Feb 2023 11:46:30 +0400 Labels: controller-uid=eb80d582-a611-4aa5-97e7-ad15441e6c45 gloo=resource-rollout job-name=gloo-resource-rollout sidecar.istio.io/inject=false Annotations: Status: Running IP: 172.17.0.2 IPs: IP: 172.17.0.2 Controlled By: Job/gloo-resource-rollout Containers: kubectl: Container ID: docker://df3c448e74eb261cf8e497b69f30a7597522be79757476254503bde92b9d5f9f Image: quay.io/solo-io/kubectl:1.11.50 Image ID: docker-pullable://quay.io/solo-io/kubectl@sha256:bb1e3aff17e6f562b581a6dcd0703ca7498d9f589267022738ff47eaa44f03b2 Port: Host Port: Command: /bin/sh -c # if validation webhook is enabled, wait for deployment rollout so validation service will be available kubectl rollout status deployment -n bai-infra gateway # apply Gloo Edge custom resources if [ $HAS_CUSTOM_RESOURCES == "true" ] then kubectl apply -f /etc/gloo-custom-resources/custom-resources || exit $? else echo "no custom resources to apply" fi # remove the resource-policy annotations that were added temporarily by the gloo-resource-migration job during upgrade kubectl annotate upstreams.gloo.solo.io -n bai-infra -l app=gloo helm.sh/resource-policy- || exit $? kubectl annotate gateways.gateway.solo.io -n bai-infra -l app=gloo helm.sh/resource-policy- || exit $? State: Running Started: Tue, 07 Feb 2023 11:46:53 +0400 Last State: Terminated Reason: Error Exit Code: 1 Started: Tue, 07 Feb 2023 11:46:35 +0400 Finished: Tue, 07 Feb 2023 11:46:39 +0400 Ready: True Restart Count: 2 Environment: HAS_CUSTOM_RESOURCES: Optional: false Mounts: /etc/gloo-custom-resources from custom-resource-config-volume (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cz9sp (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: custom-resource-config-volume: Type: ConfigMap (a volume populated by a ConfigMap) Name: gloo-custom-resource-config Optional: false kube-api-access-cz9sp: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 28s default-scheduler Successfully assigned bai-infra/gloo-resource-rollout-fmflx to minikube Warning BackOff 19s kubelet Back-off restarting failed container Normal Pulled 5s (x3 over 28s) kubelet Container image "quay.io/solo-io/kubectl:1.11.50" already present on machine Normal Created 5s (x3 over 28s) kubelet Created container kubectl Normal Started 5s (x3 over 27s) kubelet Started container kubectl ```

is it may be solved? what about server-side resources apply?

jenshu commented 1 year ago

@gektor0856 we will have to look into this further. For now, could you try using the protoDescriptorConfigMap as outlined in this comment, that will make the size of the Gateway smaller so you shouldn't hit the annotation size limit.