kedacore / keda

KEDA is a Kubernetes-based Event Driven Autoscaling component. It provides event driven scale for any container running in Kubernetes
https://keda.sh
Apache License 2.0
8.41k stars 1.06k forks source link

Scaling from 0->N doesn't get initiated on time/at all #1062

Closed hila3000 closed 4 years ago

hila3000 commented 4 years ago

We have a scaledobject based on GCP PubSub. The PubSub subscription has reached the number of messages to have the deployment scaled up, but it doesn't happen on time. Also, yesterday it didn't happen at all on the same scaledobject, until we restarted the keda operator- It reached 27000 messages (Scaledobject is defined to get triggered on value of 50).

Expected Behavior

Once Pubsub subscription reaches the number of messages as defined in the scaledobject, the deployment needs to get scaled up from 0->N, based on the polling interval of course.

Actual Behavior

First issue- Doesn't scale on time- Deployment got scaled to 0 on timestamp 1598771940.466385 (07:19:00.466 AM GMT). Pubsub messages show an increase over 50 messages (the value that's defined in the scaledobject) on 07:21 AM GMT, however at 07:38 GMT it reached 972 messages and only then it got scaled up, a.k.a after 17 minutes of a high count of messages (relatively to the scaledobject configuration).

I would expect it to scale up back to N replicas after 7 minutes the latest - 2 min of polling interval + 5 minutes of the cooling period.

Second issue - Didn't scale at all- Yesterday it didn't scale back up at all (after almost 4 hours of high count of messages) until we restarted the Keda operator. Messages count on the subscription started increasing since 13:56 GMT.

Deployment got scaled to 0 on timestamp 1598709286.193081 (13:54:46.193 PM GMT). Pubsub messages show an increase over 50 messages (the value that's defined in the scaledobject) on 13:56 PM GMT, however at 17:55 GMT it reached 27,063 messages and only after we restarted Keda operator on 17:53 PM GMT, it scaled the pods up to 1 replica.

I would expect it to scale up back to N replicas at 14:03 PM GMT the latest - 2 min of polling interval + 5 minutes of the cooling period. BUT it didn't scale at all.

LOGS- First issue- Doesn't scale on time- From Keda logs- scaling to 0 on 7:19 AM GMT- {"level":"info","ts":1598771940.466385,"logger":"scalehandler","msg":"Successfully scaled deployment to 0 replicas","ScaledObject.Namespace":"prod-us-log-forwarding","ScaledObject.Name":"lf-scaledobject-535176920","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"prod-us-log-forwarding","Deployment.Name":"log-forwarding-535176920"}

Any other references to this scaledobject after that- {"level":"info","ts":1598771965.0714204,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"prod-us-log-forwarding","Request.Name":"lf-scaledobject-535176920"} {"level":"info","ts":1598771965.071469,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"prod-us-log-forwarding","Request.Name":"lf-scaledobject-535176920"}

we have more of those reconciling...detected... and then at 07:38:50.164 AM- {"level":"info","ts":1598773130.1646843,"logger":"scalehandler","msg":"Successfully updated deployment","ScaledObject.Namespace":"prod-us-log-forwarding","ScaledObject.Name":"lf-scaledobject-535176920","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"prod-us-log-forwarding","Deployment.Name":"log-forwarding-535176920","Original Replicas Count":0,"New Replicas Count":1}

Nothing in between.

Second issue- Doesn't scale at all- From Keda logs- As for yesterday in which it didn't scale at all, only relevant errors I see in the logs are- 13:55:13 GMT- Worth mentioning this error is noticed on the logs also right before.. a.k.a right before it got successfully scaled to 0. error: "rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token": dial tcp: i/o timeout" level: "error" logger: "gcp_pub_sub_scaler" msg: "error getting Active Status" stacktrace: "github.com/go-logr/zapr.(*zapLogger).Error /Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128 github.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive keda/pkg/scalers/gcp_pub_sub_scaler.go:88 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment keda/pkg/handler/scale_loop.go:119 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale keda/pkg/handler/scale_loop.go:45 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop keda/pkg/handler/scale_loop.go:28" ts: 1598720729.596498 14:10:04 GMT - error: "rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token": unexpected EOF" level: "error" logger: "gcp_pub_sub_scaler" msg: "error getting Active Status" stacktrace: "github.com/go-logr/zapr.(*zapLogger).Error /Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128 github.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive keda/pkg/scalers/gcp_pub_sub_scaler.go:88 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment keda/pkg/handler/scale_loop.go:119 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale keda/pkg/handler/scale_loop.go:45 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop keda/pkg/handler/scale_loop.go:28" ts: 1598710204.4026008

These 2 errors types gets reported quite often.

Steps to Reproduce the Problem

I don't know how this can be reproduced, but I'll just share the configurations I have.

Scaledobject- (Don't mind the wrong indentation.. Happened when pasted here) apiVersion: keda.k8s.io/v1alpha1 kind: ScaledObject metadata: creationTimestamp: "2020-08-23T10:26:34Z" finalizers: finalizer.keda.k8s.io generation: 2 labels: deploymentName: log-forwarding-535176920 name: lf-scaledobject-535176920 namespace: prod-us-log-forwarding resourceVersion: "85343895" selfLink: /apis/keda.k8s.io/v1alpha1/namespaces/prod-us-log-forwarding/scaledobjects/lf-scaledobject-535176920 uid: 1e802b1a-e52b-11ea-8145-42010ab50004 spec: cooldownPeriod: 300 maxReplicaCount: 5 minReplicaCount: 0 pollingInterval: 120 scaleTargetRef: deploymentName: log-forwarding-535176920 triggers: metadata: credentials: GENERIC_GOOGLE_CREDENTIALS subscriptionName: lf-data-535176920-sub subscriptionSize: "50" type: gcp-pubsub

Specifications

zroubalik commented 4 years ago

I curious to see what DEBUG logs would tell us, https://github.com/kedacore/keda/tree/master#setting-log-levels

balchua commented 4 years ago

Do you think there is a need to close the client? I don't see it being closed.

hila3000 commented 4 years ago

@balchua Did you ask me that? not sure if closing the client is something that can be done on configuration, let me know. @zroubalik I've enabled debug now, waiting for a new occurrence in the logs of the errors to see if it prints more information about that.

balchua commented 4 years ago

Sorry its more for @zroubalik 😊

hila3000 commented 4 years ago

@zroubalik What say you on @balchua comment? We are in production with Keda, and these issues happen often, the only "fix" is to restart Keda-operator pod.. to a point we thought to set low memory limits to have the pod restarted on its own... Please advise.

As for debug, I found an issue in dev of a message that got ingested only after 47 minutes. on 7:00 GMT the message got into the sub, lf-scaledobject-1546020352 is its name. on 7:47 GMT the message got ingested, a.k.a only then keda-operator scaled the pod back up from 0->1.

Relevant Logs- There were more in between, but these are the relevant ones, a.k.a about this specific sub or general errors

{"level":"debug","ts":"2020-09-03T06:54:14.985Z","logger":"scalehandler","msg":"Deployment no change","ScaledObject.Namespace":"dev-log-forwarding","ScaledObject.Name":"lf-scaledobject-1546020352","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"dev-log-forwarding","Deployment.Name":"log-forwarding-1546020352"} {"level":"debug","ts":"2020-09-03T07:09:20.484Z","logger":"scalehandler","msg":"Deployment no change","ScaledObject.Namespace":"dev-log-forwarding","ScaledObject.Name":"lf-scaledobject-1546020352","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"dev-log-forwarding","Deployment.Name":"log-forwarding-1546020352"}

At this point, 7:09, it should have scaled the pod up, but it says "Deployment no change".

{"level":"error","ts":"2020-09-03T07:12:01.685Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp 199.36.153.8:443: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"} {"level":"error","ts":"2020-09-03T07:12:02.085Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": EOF","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"} {"level":"error","ts":"2020-09-03T07:12:24.284Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}

BTW, I'm expecting to see an error like that about the relevant sub, lf-scaledobject-1546020352 - {"level":"debug","ts":"2020-09-03T07:12:21.684Z","logger":"scalehandler","msg":"Error getting scale decision","ScaledObject.Namespace":"dev-log-forwarding","ScaledObject.Name":"lf-scaledobject-884715244","ScaledObject.ScaleType":"deployment","Error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout"} but there isn't.... (this one is example to another sub)

{"level":"error","ts":"2020-09-03T07:12:24.284Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"} {"level":"error","ts":"2020-09-03T07:13:12.483Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": write tcp 172.20.8.176:41702->199.36.153.8:443: write: broken pipe","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"} {"level":"error","ts":"2020-09-03T07:14:11.186Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"} {"level":"error","ts":"2020-09-03T07:14:11.186Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"} {"level":"error","ts":"2020-09-03T07:14:12.485Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"} {"level":"error","ts":"2020-09-03T07:14:43.087Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"} {"level":"error","ts":"2020-09-03T07:17:50.687Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"} {"level":"error","ts":"2020-09-03T07:18:01.884Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"} {"level":"error","ts":"2020-09-03T07:18:22.885Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"} {"level":"error","ts":"2020-09-03T07:18:44.481Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": write tcp 172.20.8.176:57870->199.36.153.8:443: write: broken pipe","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"} {"level":"error","ts":"2020-09-03T07:19:19.682Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}

I have 10 more occurrences of - {"level":"error","ts":"2020-09-03T07:21:55.590Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}

and then one of- {"level":"error","ts":"2020-09-03T07:22:15.284Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": write tcp 172.20.8.176:58202->199.36.153.9:443: write: broken pipe","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}

We defined cloud DNS so we can bypass FW when communicating with google APIs. the 199 address is of that googleapi, first time to see it in the log after we defined the FW bypass, but it doesn't get reflected when we do a dig/nslookup command, only via the error we see it seems to be working. or.. not, broken pipe.

and now an unexpected EOF as well, until now we already encountered timeouts- {"level":"error","ts":"2020-09-03T07:22:37.382Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": unexpected EOF","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}

back to timeouts- {"level":"error","ts":"2020-09-03T07:24:03.181Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"} {"level":"error","ts":"2020-09-03T07:24:03.384Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"} {"level":"error","ts":"2020-09-03T07:25:01.087Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}

eventually it says the error I expected to see earlier- {"level":"debug","ts":"2020-09-03T07:25:01.087Z","logger":"scalehandler","msg":"Error getting scale decision","ScaledObject.Namespace":"dev-log-forwarding","ScaledObject.Name":"lf-scaledobject-1546020352","ScaledObject.ScaleType":"deployment","Error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout"}

then- {"level":"debug","ts":"2020-09-03T07:25:01.087Z","logger":"scalehandler","msg":"Deployment no change","ScaledObject.Namespace":"dev-log-forwarding","ScaledObject.Name":"lf-scaledobject-1546020352","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"dev-log-forwarding","Deployment.Name":"log-forwarding-1546020352"}

Errors about timeouts or getting active status stopped being reported, and then eventually the scale started working- {"level":"debug","ts":"2020-09-03T07:46:31.683Z","logger":"scalehandler","msg":"Scaler for scaledObject is active","ScaledObject.Namespace":"dev-log-forwarding","ScaledObject.Name":"lf-scaledobject-1546020352","ScaledObject.ScaleType":"deployment","Scaler":{}} {"level":"info","ts":"2020-09-03T07:46:32.182Z","logger":"scalehandler","msg":"Successfully updated deployment","ScaledObject.Namespace":"dev-log-forwarding","ScaledObject.Name":"lf-scaledobject-1546020352","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"dev-log-forwarding","Deployment.Name":"log-forwarding-1546020352","Original Replicas Count":0,"New Replicas Count":1}

I hope this helps.

balchua commented 4 years ago

@hila3000 if you have the chance to build the master branch of keda, you can have a look at the stackdriver client code here (https://github.com/kedacore/keda/blob/master/pkg/scalers/stackdriver_client.go).

In the method GetMetrics perhaps add a defer s.metricsClient.Close(). This is more a hunch as i don't know much about gcp pub sub.

Instructions to build keda is in the Readme. This may help speed up troubleshooting your issue.

Just my 2 cents. 😊

zroubalik commented 4 years ago

I can see, that there are timeouts when communicating with the gcp pub sub:

rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout

This is the reason for non scaling.

@hila3000 btw, you can use ``` for multiline code blocks, its better for the bigger chunks of code or logs (you can specify syntax highlighting as well: https://docs.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks

And @balchua is probably right, I am not expert on gcp pub sub, but the client should be probably closed. This might cause the problems.

zroubalik commented 4 years ago

@balchua haha, commented in the same time.

I think the close should be called for sure, as per the client doc: https://github.com/googleapis/google-cloud-go/blob/4de835178d583bdbbe5c8ec10befda4f59373d32/monitoring/apiv3/metric_client.go#L196-L200

zroubalik commented 4 years ago

I'd add the close call in here: https://github.com/kedacore/keda/blob/v2/pkg/scalers/gcp_pub_sub_scaler.go#L94

hila3000 commented 4 years ago

This should eliminate the timeouts? rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout

balchua commented 4 years ago

@zroubalik yeah the Close() method. Totally forgot it. We need to have the stackdriver client part of the scaler. Right now its not stored.

@hila3000 if you have the chance to build keda master, you may want to try closing the client just to test it out.

hila3000 commented 4 years ago

@balchua I tried - added the close in the file, ran docker build, but when I set the new image for the keda-operator deployment, it says- Error: failed to start container "keda-operator": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"keda\": executable file not found in $PATH": unknown

Any idea why?

balchua commented 4 years ago

@hila3000 did you do a make build?

hila3000 commented 4 years ago

I did now... anything else before I hit the "docker build" command?

zroubalik commented 4 years ago

@hila3000 if you are building it against master branch, all you need to do is run make publish to build the images locally and push them to Docker hub. Or you can run make build and run docker push manually.

hila3000 commented 4 years ago

@zroubalik @balchua So let's recap. I did git clone to this repo, checkout to master did the change in pkg/scalers/stackdriver_client.go (added the defer s.metricsClient.Close() on row 48), and then tried to run "make publish" but it fails-

hfish:keda (master)$ make publish WARNING: GOROOT is not defined GO111MODULE=on CGO_ENABLED=0 GOOS=linux GOARCH=amd64 operator-sdk generate k8s /bin/sh: operator-sdk: command not found make: *** [generate-api] Error 127

Any thoughts? how do I define the GOROOT? where? why is the command not found?

Thanks a LOT for your help :)

hila3000 commented 4 years ago

also make build doesn't work-

hfish:keda (master)$ make build WARNING: GOROOT is not defined GO111MODULE=on CGO_ENABLED=0 GOOS=linux GOARCH=amd64 operator-sdk generate k8s /bin/sh: operator-sdk: command not found make: *** [generate-api] Error 127

it did work when I ran it on v2 branch.

zroubalik commented 4 years ago

OK, I can see.

You will need to install operator-sdk v0.11.0 https://github.com/operator-framework/operator-sdk/releases/tag/v0.11.0

If you have troubles with the instalation etc, if I spare some time I can give it a shot a produce an image with the change for you.

hila3000 commented 4 years ago

@zroubalik The files in https://github.com/operator-framework/operator-sdk/releases/tag/v0.11.0 are without files extensions.. so not sure how should it be installed. For example, I've downloaded "operator-sdk-v0.11.0-x86_64-apple-darwin" cause this looks suitable for mac, but it's not a dmg so I can't really install it.

If you can create an image for me with the relevant fix it'll be GREAT and MUCH APPRECIATED (If it can be before Saturday it'll be super helpful too).

balchua commented 4 years ago

Oh darwin is for MacOS. For linux try this:

# Set the release version variable
$ RELEASE_VERSION=v0.11.0
# Linux
$ curl -LO https://github.com/operator-framework/operator-sdk/releases/download/${RELEASE_VERSION}/operator-sdk-${RELEASE_VERSION}-x86_64-linux-gnu
$ chmod +x operator-sdk-${RELEASE_VERSION}-x86_64-linux-gnu && sudo mkdir -p /usr/local/bin/ && sudo cp operator-sdk-${RELEASE_VERSION}-x86_64-linux-gnu /usr/local/bin/operator-sdk && rm operator-sdk-${RELEASE_VERSION}-x86_64-linux-gnu

I copied it from the operator-sdk installation guide guide

zroubalik commented 4 years ago

@hila3000 you can try these images: KEDA Operator: docker.io/zroubalik/keda:gcpclose KEDA Metrics Adapter: docker.io/zroubalik/keda-metrics-adapter:gcpclose

it is custom v1 build, which is more likely won't be released. I plan to include this change into the upcoming v2. I am not sure if it is gonna actually help with your issue, I'd recommend to switch to v2 anyway, as there were some performance improvements internally etc and those could help with your issue as well.

KEDA v2 Beta is scheduled for the beginning of the next week.

hila3000 commented 4 years ago

@balchua Thanks for the explanation. I do need for mac though, so it's the same concept and commands you wrote, just with the darwin files? @zroubalik Thanks! I'll deploy these versions in dev for now and see if it solves our issues. Will deploy regardless v2 beta on dev as well next week when it'll get released, and update on the results. Any estimation when v2 GA version will get released? (For Production deployment).

hila3000 commented 4 years ago

@zroubalik The version seems to be ok. Will give it more time to verify it, also seem to stabilize the memory usage, which is great. Question though... What does that error mean-

{"level":"error","ts":"2020-09-05T19:13:01.738Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"Could not find stackdriver metric with filter metric.type=\"pubsub.googleapis.com/subscription/num_undelivered_messages\" AND resource.labels.subscription_id=\"lf-data-1272148120-sub\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tkeda/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:89\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}

The sub does exist, messages count is 0 but still, it exists.

balchua commented 4 years ago

@hila3000 great news!

Regarding that error, it seems to be benign.

The error simply suggests that the stackdriver monitoring client did not return any timeseries data within that period and for that particular metric.

hila3000 commented 4 years ago

@balchua @zroubalik I think it's safe to say that the fix worked. Thanks a lot! We face a different issue now- Now that it works, we've set the pollinginterval back to 5 seconds, and it introduced a quota issue-

from keda-operator pod- {"level":"error","ts":"2020-09-06T07:57:58.138Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = ResourceExhausted desc = Quota exceeded for quota metric 'Time series queries' and limit 'Time series queries per minute' of service 'monitoring.googleapis.com' for consumer 'project_number:xxxxxxxxxxx'.","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tkeda/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:89\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}

from keda-metrics-apiserver pod- E0906 09:13:37.646708 1 provider.go:94] keda_metrics_adapter/provider "msg"="error getting metric for scaler" "error"="rpc error: code = ResourceExhausted desc = Quota exceeded for quota metric 'Time series queries' and limit 'Time series queries per minute' of service 'monitoring.googleapis.com' for consumer 'project_number:xxxxxxxxxxx'." "ScaledObject.Name"="lf-scaledobject-1016483367" "ScaledObject.Namespace"="prod-us-log-forwarding" "Scaler"={}

I opened a ticket to GCP to see if we can increase the quota, but still thought to ask - Any way to handle it from Keda's side?

hila3000 commented 4 years ago

@balchua Regarding what you wrote- Within that period of time - But I checked the metric on GCP side and it consistently reports 0 messages. So I'm not sure why it says that the stackdriver monitoring client did not return any timeseries data.

balchua commented 4 years ago

The keda gcp pubsub scaler performs a metrics query on stackdriver within a time range (now - 2mins).
So if stackdriver doesn't return record(s) the keda gcp scaler will log that "error". But i believe it is benign from what i can tell. Does it not behave as you expect?

hila3000 commented 4 years ago

For now, it seems ok, no apparent issues.

balchua commented 4 years ago

Cool. Memory utilization also seems ok?

hila3000 commented 4 years ago

Yes :)

zroubalik commented 4 years ago

Glad it is solved now, closing.

hila3000 commented 3 years ago

@zroubalik are the fixes you did in - KEDA Operator: docker.io/zroubalik/keda:gcpclose KEDA Metrics Adapter: docker.io/zroubalik/keda-metrics-adapter:gcpclose

Also exist in keda 2.0.0 and forward?

zroubalik commented 3 years ago

@hila3000 yeah, the fix is present in 2.0.0 and forward.

hila3000 commented 3 years ago

Great, thanks!