Bad gateway failures of gRPC

leokondrashov commented 1 year ago

Version & Branch

main branch, default configuration in scripts/setup/setup.cfg

Expected Behaviour

No failures in calls

Actual Behaviour

Sometimes, ~10% of invocations fails with error gRPC timeout exceeded for function trace-func-0-18129329275763004641 - rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 502 (Bad Gateway); transport: received unexpected content-type "text/plain; charset=utf-8"

Steps to Reproduce the Problem

xl170, 2 nodes (master + worker)
Run attached trace in loader (synthetic trace with one function, 4s runtime, invoked every ~4 seconds) durations.csv invocations.csv memory.csv

Additional Info

Error observed on number of clusters and traces. Seems that this issue is connected to #129. Before failing with "Invalid argument provided..." there are messages telling that there are many failures, that happens to be due to this error. Sometimes, error changes to 504 (Gateway Timeout) or unexpected EOF. Most of the failures happen in the first minutes, but it continues to fail throughout the experiment. Logs from the loader is attached as well. loader_trace.txt

leokondrashov commented 1 year ago

@cvetkovic, can you please take a look and tell if you have some input on these error messages?

leokondrashov commented 1 year ago

These errors were caused by knative terminating pods that were handling requests. This was in the queueproxy container logs of one of the pods:

{"severity":"INFO","timestamp":"2023-03-02T16:33:32.872021599Z","message":"Attached drain handler from user-container"}
{"severity":"INFO","timestamp":"2023-03-02T16:33:32.878217861Z","message":"Received TERM signal, attempting to gracefully shutdown servers."}
{"severity":"INFO","timestamp":"2023-03-02T16:33:32.878243962Z","message":"Sleeping 30s to allow K8s propagation of non-ready state"}
{"severity":"INFO","timestamp":"2023-03-02T16:34:02.872890824Z","message":"Shutting down server: admin"}
{"severity":"INFO","timestamp":"2023-03-02T16:34:02.874405272Z","message":"Shutting down server: metrics"}
{"severity":"INFO","timestamp":"2023-03-02T16:34:02.875615551Z","message":"Shutdown complete, exiting..."}

Four other pods were terminated at the same time (scaling down from initial burst of pod deployments). And five gRPC failures were registered at 16:34:02.910 UTC.

Such errors might surface when we have bursty invocation pattern: scaling down after burst ended might terminate pod handling the request.

The question is, how autoscaler should detect working pods and not terminate them before they finish? @anshalshukla, do you have something on that matter from theoretical standpoint?

cvetkovic commented 1 year ago

What do you use for sandboxing? containerd or Firecracker?

leokondrashov commented 1 year ago

Containerd. Configuration is running ./scripts/setup/create_multinode.sh <nodes> with default config in scripts/setup/setup.cfg

cvetkovic commented 1 year ago

I have personally never had any issues like what you posted with containerd, and I set up a cluster last week that still works without any issues. With Firecracker, I did have a lot of gRPC timeout as the uVMs got killed for some reason I never managed to explain. Probably a buggy implementation of the kubelet <-> Firecracker-containerd interface.

Another thing that I think about maybe your cluster is too small to make everything fit. I see that the PR that updates Knative to v1.9 plays with tainting and that may have implications.

As per your question about how running pods are detected, I think the activator takes care of that, i.e. it has the list of all sandbox endpoints that are provided by the Kubernetes endpoint controller. The autoscaler just issues a decision to upscale or downscale to some number of instances.

leokondrashov commented 1 year ago

I don't think that this has something to do with the size of the cluster. For attached trace, at most it used 25 pods with 83m CPU limit, so ~2 cores total.

vhive-serverless / invitro