owncloud / ocis

:atom_symbol: ownCloud Infinite Scale Stack
https://doc.owncloud.com/ocis/next/
Apache License 2.0
1.26k stars 169 forks source link

Scaling oCIS in kubernetes causes requests to fail #8589

Open butonic opened 4 months ago

butonic commented 4 months ago

During loadtests we seem to be losing requests. We have identified several possible causes:

1. when a new pod is added it does not seem to receive traffic

This might be caused by clients not picking up the new service. One reason would be that the same grpc connection is reused. We need to make sure that every service uses the a selector.Next() call to get a fresh client from the registry.

2. when a pod is shut down because kubernetes moves it to a different node or it is descheduled it still receives traffic

This might be caused by latency. The client got a grpc client with selector.Next() but then the pod was killed before the request reached it. We should retry requests, but the grpc built in retry mechanism would need to know all possible services. That is not how the reva pool works.

We could configure the grpc connection to retry requests:


    var retryPolicy = `{
        "methodConfig": [{
            // config per method or all methods under service
            "name": [{"service": "grpc.examples.echo.Echo"}],
            "waitForReady": true,

            "retryPolicy": {
                "MaxAttempts": 4,
                "InitialBackoff": ".01s",
                "MaxBackoff": ".01s",
                "BackoffMultiplier": 1.0,
                // this value is grpc code
                "RetryableStatusCodes": [ "UNAVAILABLE" ]
            }
        }]
    }`

    conn, err := grpc.Dial(
        address,
        grpc.WithTransportCredentials(cred),
        grpc.WithDefaultServiceConfig(retryPolicy),
        grpc.WithDefaultCallOptions(
            grpc.MaxCallRecvMsgSize(maxRcvMsgSize),
        ),
        grpc.WithStatsHandler(otelgrpc.NewClientHandler(
            otelgrpc.WithTracerProvider(
                options.tracerProvider,
            ),
            otelgrpc.WithPropagators(
                rtrace.Propagator,
            ),
        )),
    )

but they would just try the same ip. To actually send requests to different servers, aka client side load balancing we would have to add sth. like:

    // Make another ClientConn with round_robin policy.
    roundrobinConn, err := grpc.Dial(
        fmt.Sprintf("%s:///%s", exampleScheme, exampleServiceName),
        grpc.WithDefaultServiceConfig(`{"loadBalancingConfig": [{"round_robin":{}}]}`), // This sets the initial balancing policy.
        grpc.WithTransportCredentials(insecure.NewCredentials()),
    )

The load balancing works based on name resolving.

We could add all this to the reva pool ... or we use a go micro grpc client that already implements a pool, integrates with the service registry and can do retry, backoff and whatnot. But this requires generating micro glients for the cs3 api using github.com/go-micro/generator/cmd/protoc-gen-micro

3. pod readyness and health endpoints do not reflect the actual state of the pod

Currently, the /healthz and /readyz endpoints are independent from the actual service implementation. But some services need some time to be ready or flush all requests on shutdown. This also needs to be investigated. For ready we could use a channel to communicate between the actual handler and the debug handler. And AFAIR @rhafer mentioned we need to take care of shutdown functions ... everywhere.

4. the services are needlessly split into separate pods

Instead of startinf a pod for every service we should aggregate all processes that are involved in translating a request until they reach a storage provider:

The services should use localhost or even unix sockets to talk to each other. go can very efficiently use the resources in a pod an handle requests concurrently. We really only create a ton of overhead that stresses the kubernetes APIs and can be reduced.

rhafer commented 4 months ago

And AFAIR @rhafer mentioned we need to take care of shutdown functions ... everywhere.

Hm, I don't remember what exactly I mentioned, but the biggest issues with shutdown were IIRC related to running ocis in single binary mode, because reva just does an os.Exit() from the first service finishing the SIGTERM/SIGQUIT/SIGINT signal handler, causing all other services to go away with before finishing their shutdown,l obviously.

When running as separate services there is already the possiblity to do a more graceful shutdown for the reva services. By default reva does this only when shutdown via SIGQUIT. When a setting graceful_shutdown_timeout to something != 0 (in the reva config) the graceful shutdown can also be triggered by sending the default SIGTERM signal. (AFAIK we currently only expose graceful_shutdown_timeout in ocis for the storage-users service. (For details: https://github.com/cs3org/reva/pull/4072, https://github.com/owncloud/ocis/pull/6840)

wkloucek commented 3 months ago

Please also see https://github.com/owncloud/enterprise/issues/6441:

oCIS doesn't benefit from the Kubernetes readiness probes behavior since it's not using Kubernetes Services to talk to each other. It uses the go micro service registry instead that knows / doesn't know about service readiness!??

For a specific Kubernetes environment with Cilium: If we could just configure hostnames / DNS names and not use the micro registry, we probably could leverage Cilium for load balancing: https://docs.cilium.io/en/stable/network/servicemesh/envoy-load-balancing/ (but it's in beta state)

Please also be aware of the "retry" concept: https://github.com/grpc/grpc-go/blob/master/examples/features/retry/README.md

micbar commented 2 months ago

@butonic @kobergj @dragonchaser

I think we should start working on 2)

butonic commented 1 month ago

What is the current state of this. We found a few bugs that explain why the search service was not scaling.

AFAICT we need to reevaluate this with a load test.

butonic commented 1 month ago

There are two options. 1. use native GRPC mechanisms to retry. 2. generate go micro clients for the CS3 API.

I'd vote the latter, because go micro already retries requests that time out and we want to move some services into ocis anyway.

A first step could be to generate go micro clients for the gateway so our graph service can use them to make CS3 calls against the gateway.

another step would be to bring ocdav to ocis ... and then replace all grpc clients with go micre generated clients.

This is a ton of work. 😞

Note that using the native GRPC client and teaching it to retry services also requires configuring which calls should be retried.

Maybe we can just tell the grpc client in the reva pool to retry all requests?

Then we would still have two ways of making requests ... I'm not sure if we can use the native grpc retry mechanism, becasue we are using a single ip addresse that has been resolved with a micro selector. AFAICT the the grpc client cannot use DNS to find the next ip.

Two worlds are colliding here ...

πŸ’₯

butonic commented 1 month ago

Furthermore, I still want to be able to use an in memory transport, which we could use when embracing go micro further.

dragonchaser commented 1 month ago

Furthermore, I still want to be able to use an in memory transport, which we could use when embracing go micro further.

https://github.com/owncloud/ocis/issues/9321 we have to discuss about this

dj4oC commented 1 week ago

Priority increasing due to multiple customers are effected \cc @dragotin

micbar commented 1 week ago

@dj4oC Can you please provide more info from the other customers too?

dj4oC commented 1 week ago

The customer @grischdian & @blicknix is reporting, that after kubectl patch oCIS does not work because new requests still try to reach old pods. kubectl deploy on the other hand does work, because the registration is done from scratch (new pods all over). Unfortunately we cannot export logs due to security constraints. Deployment is done on with OpenShift and ArgoCD.

butonic commented 1 week ago

um

# kubectl deploy
error: unknown command "deploy" for "kubectl"

Did you mean apply?

What MICRO_REGISTRY is configured?

butonic commented 1 week ago

@dj4oC @grischdian @blicknix the built in nats in the ocis helm chart cannot be scaled. you have to keep the replica at 1. if you need a redundant deployment use a dedicated nats cluster.

running multiple nats instances from the ocis chart causes a split brain situation where service lookups might return stale data. this is related to kubernetes scale up / down, but we tackled scale up and should pick up new pods properly.

This issue is tracking scale down problems, which we can address by retrying calls. Unfortuately, that is a longer path because we need to touch a lot of code.

kubectl apply vs kubectl patch vs argocd are a different issue.

blicknix commented 1 week ago

We only have one nats pod in den environment as it is only a dev environment. So no split brain. MICRO_REGISTRY is nats-js-kv

butonic commented 1 week ago

I think I have found a way to allow using the native grpc-go Thick Client round robin load balancing using the dns:/// transport and kubernetes headless services by taking into account the transport in the service metadata. It requires reading the transport from the service and registering services with a configurable transport.

This works without ripping out the go micro service registry but we need to test these changes with helm charts that use headless services und configure the grpc protocol to be dns.

πŸ€”

hm and we may have to register the service with its domain name ... not the external ip ... urgh ... needs more work.

butonic commented 1 week ago

@wkloucek @d7oc what were the problems when ocis was using the kubernetes service registry? AFAIK etcd was under heavy load.

@dragonchaser mentioned that it is possible to set up an etcd per namespace to shard the load.

when every school uses ~40 pods and every pod registers a watcher on the kubernetes api (provided by etcd) and reregisters itself every 30 sec that does create some load. I don't know if the go micro kubernetes registry subscribes to ALL pod events or if it is even possible to only receive events for a single namespace. I can imagine when every pod change needs to be propagated to every watcher that that might cause load problems.

So if you can shed some light on why the kubernetes registry was 'bad' I'd be delighted.

butonic commented 1 week ago

Our curent guess is that the go micro kubernetes registry was registering services in the default namespace because of a bug. when tasting on a single instance in a cluster things would be fine .... deploying more than one should break the deployment because services from multiple instances would 'see' each other. Which would explain the high load on the kubernetes API where every ocis pod is watching every ocis pod in every school ... 😞

wkloucek commented 1 week ago

I'd honestly refuse to use the "Kubernetes go-micro registry" in production even if you address some points that you described above.

I would not use it, since it introduces a thight coupling between the Kubernetes Control Plane and the workload (in this case oCIS). During Kubernetes Cluster operations (eg. updating Kubernetes itself or the infra below, especially with a setup like Gardener https://gardener.cloud), you may have situations where the Control Plan is "down" / the Kubernetes API is unreachable for some minutes. The workers / kubelets / containers in the CRI will keep running unchanged.

If you're using the "Kubernetes go-micro service registry" in this case, your workload will also be down after it reached the cache TTL since no more communication to the Kubernetes API is possible.

If you use eg. NATS as a go-micro service registry, it'll continue running and a control plane / Kubernetes API downtime will have zero impact (as long as there are no node failures, load changes, ...)

EDIT, just as a addition: the cluster DNS will also keep working while the Kubernetes API is down. So using DNS for service discovery is a valid way to go from my point of view.

wkloucek commented 1 week ago

Maybe @grischdian & @blicknix you could share your Kubernetes API availability / downtimes, too?

I guess you don't have 99,999% (26s downtime in a month) Kubernetes API availability, right?

wkloucek commented 4 days ago

I don't know how https://github.com/owncloud/ocis/issues/9535 may be related here

grischdian commented 12 hours ago

well since we are only the "user" of the openshift we have no numbers on the availability. I Working on this issue in parallel to figure out if argo is the reason. But what I can confirm: we have no nats scaled in the environment. I will come with an update later today.

micbar commented 12 hours ago

@butonic Is still on vacation.

We will have no progress on this within this week.