teamhephy / workflow

Hephy Workflow - An open source fork of Deis Workflow - The open source PaaS for Kubernetes.
MIT License
411 stars 36 forks source link

deis autoscale issue with k8s 1.9.6 (created with kops 1.9.0) #57

Closed dmcnaught closed 5 years ago

dmcnaught commented 6 years ago

When I create an hpa with deis autoscale I get this error:

deis autoscale:set cmd --min=1 --max=4 --cpu-percent=10 -a test
kubectl -n test describe hpa
Name:                                                  test-cmd
Namespace:                                             test
Labels:                                                app=test
                                                       heritage=deis
                                                       type=cmd
Annotations:                                           <none>
CreationTimestamp:                                     Wed, 18 Apr 2018 14:36:33 -0600
Reference:                                             Deployment/test-cmd
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  <unknown> / 10%
Min replicas:                                          1
Max replicas:                                          4
Conditions:
  Type         Status  Reason          Message
  ----         ------  ------          -------
  AbleToScale  False   FailedGetScale  the HPA controller was unable to get the target's current scale: no matches for /, Kind=Deployment
Events:
  Type     Reason          Age   From                       Message
  ----     ------          ----  ----                       -------
  Warning  FailedGetScale  2s    horizontal-pod-autoscaler  no matches for /, Kind=Deployment

when I set the autoscale with kubernetes, I don't get that error:

kubectl -n test autoscale deployment test-cmd --cpu-percent=10 --min=1 --max=4
--- apps/deis ‹master* MRM?› » ku -n test describe hpa
Name:                                                  test-cmd
Namespace:                                             test
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Wed, 18 Apr 2018 14:38:37 -0600
Reference:                                             Deployment/test-cmd
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  0% (0) / 10%
Min replicas:                                          1
Max replicas:                                          4
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  the last scale time was sufficiently old as to warrant a new scale
  ScalingActive   True    ValidMetricFound  the HPA was able to succesfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  True    TooFewReplicas    the desired replica count is more than the maximum replica count

When I get the hpa -oyaml, the difference seems to be spec: maxReplicas: 4 minReplicas: 1 scaleTargetRef: apiVersion: extensions/v1beta1 and when I add that line to the deis-created hpa, it doesn't fix the problem...

Cryptophobia commented 6 years ago

Sounds like the API must have changed. We will need to look at it and figure out the fix.

dmcnaught commented 6 years ago

K8s 1.9 made v2beta1 of autoscaling the default version (https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.9.md/#other-notable-changes-11), but kops didn't seem to do this: https://github.com/kubernetes/kops/blob/master/docs/horizontal_pod_autoscaling.md#support-for-multiple-metrics Since deis is creating an autoscaling/v1 definition, I'm confused as to why it would be failing. K8s shouldn't be making breaking changes to an existing api...

dmcnaught commented 6 years ago

@itskingori I saw you wrote the kops horizontal pod autoscaler docs, do you have time to review and comment on this?

itskingori commented 6 years ago

@dmcnaught If you're using kops 1.9.0, then the relevant flags should be set already. Make sure you have this though:

spec:
  kubeControllerManager:
    horizontalPodAutoscalerUseRestClients: true
  kubeAPIServer:
    runtimeConfig:
      autoscaling/v2beta1: "true"

That said, what version of metrics server do you have installed? I hope it's this guy.

dmcnaught commented 6 years ago

@itskingori - thanks for taking a look.

Since we don't want to use multiple metrics, I thought we could just continue using the autoscaling/v1 and not set that section on our cluster spec. Is that correct?

Yes - we are adding kubectl apply -f https://raw.githubusercontent.com/kubernetes/kops/master/addons/metrics-server/v1.8.x.yaml - although the docs seem to indicate that we shouldn't need it - I opened a ticket for that: https://github.com/kubernetes/kops/issues/5033

itskingori commented 6 years ago

Since we don't want to use multiple metrics, I thought we could just continue using the autoscaling/v1 and not set that section on our cluster spec. Is that correct?

Right. Correct.

the HPA controller was unable to get the target's current scale: no matches for /, Kind=Deployment

This does seem like a strange error to me. 🤔

itskingori commented 6 years ago

@dmcnaught Could you share the result of api version of your deployment i.e test-cmd? This part ...

apiVersion: apps/v1beta2
kind: Deployment

And the HPA, this part ...

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment

And the api version enabled on your cluster? In my case I have this ...

$ kubectl api-versions
apiextensions.k8s.io/v1beta1
apiregistration.k8s.io/v1beta1
apps/v1beta1
apps/v1beta2
authentication.k8s.io/v1
authentication.k8s.io/v1beta1
authorization.k8s.io/v1
authorization.k8s.io/v1beta1
autoscaling/v1
autoscaling/v2beta1
batch/v1
batch/v1beta1
batch/v2alpha1
certificates.k8s.io/v1beta1
extensions/v1beta1
metrics.k8s.io/v1beta1
networking.k8s.io/v1
policy/v1beta1
rbac.authorization.k8s.io/v1
rbac.authorization.k8s.io/v1alpha1
rbac.authorization.k8s.io/v1beta1
storage.k8s.io/v1
storage.k8s.io/v1beta1
v1

Make sure your deployments apiVersion is enabled. 🤔

dmcnaught commented 6 years ago

Sure.

>kubectl -n test get deployment -oyaml
apiVersion: v1
items:
- apiVersion: extensions/v1beta1
  kind: Deployment
kubectl -n test get hpa -oyaml
apiVersion: v1
items:
- apiVersion: autoscaling/v1
  kind: HorizontalPodAutoscaler
  metadata:
    annotations:
      autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"False","lastTransitionTime":"2018-04-20T14:34:45Z","reason":"FailedGetScale","message":"the
        HPA controller was unable to get the target''s current scale: no matches for
        /, Kind=Deployment"}]'
    creationTimestamp: 2018-04-20T14:34:15Z
    labels:
      app: test
      heritage: deis
      type: cmd
    name: test-cmd
    namespace: test
    resourceVersion: "322851"
    selfLink: /apis/autoscaling/v1/namespaces/test/horizontalpodautoscalers/test-cmd
    uid: e684fe85-44a7-11e8-bdc0-0eede6d2047e
  spec:
    maxReplicas: 4
    minReplicas: 1
    scaleTargetRef:
      kind: Deployment
      name: test-cmd
    targetCPUUtilizationPercentage: 10
kubectl api-versions                                                                                                                                                                                               130 ↵
admissionregistration.k8s.io/v1beta1
apiextensions.k8s.io/v1beta1
apiregistration.k8s.io/v1beta1
apps/v1
apps/v1beta1
apps/v1beta2
authentication.k8s.io/v1
authentication.k8s.io/v1beta1
authorization.k8s.io/v1
authorization.k8s.io/v1beta1
autoscaling/v1
autoscaling/v2beta1
batch/v1
batch/v1beta1
certificates.k8s.io/v1beta1
events.k8s.io/v1beta1
extensions/v1beta1
metrics.k8s.io/v1beta1
monitoring.coreos.com/v1
networking.k8s.io/v1
policy/v1beta1
rbac.authorization.k8s.io/v1
rbac.authorization.k8s.io/v1beta1
storage.k8s.io/v1
storage.k8s.io/v1beta1
v1
itskingori commented 6 years ago

@dmcnaught Your deployment doesn't have apiVersion in the scaleTargetRef, see:

- apiVersion: autoscaling/v1
  kind: HorizontalPodAutoscaler
  spec:
    scaleTargetRef:
      kind: Deployment
      name: test-cmd

I'm thinking that since you're using 1.9 the default kind for Deployment is not extensions/v1beta1 (which is what you have on your deployment). Maybe try one of these:

Also check these out:

dmcnaught commented 6 years ago

Hi @itskingori, In my original post I mentioned that I added 'apiVersion: extensions/v1beta1' to scaleTargetRef and it didn't help. I tried it again though and this time it worked... maybe I didn't wait long enough last time - so that works. For the first recommendation - do you mean change the deployment from apiVersion: extensions/v1beta1 to apiVersion: apps/v1beta2 - or should I still keep it as extensions rather than apps...?

dmcnaught commented 6 years ago

Thanks for figuring it out!

itskingori commented 6 years ago

@dmcnaught ...

For the first recommendation - do you mean change the deployment from apiVersion: extensions/v1beta1 to apiVersion: apps/v1beta2 ...

Either. Theoretically both should work. What we're seeking here is the same version on the HPA that the Deployment is defined as. The HPA cannot be looking for the Deployment in an API group/version that it is not in.

... - or should I still keep it as extensions rather than apps...?

I recommend changing your Deployment to the apiVersion that your version uses by default. For 1.9 looks like it's actually apps/v1


In 1.8, Deployments graduated to apps/v1beta2:

In the 1.8 release, we introduce the apps/v1beta2 API group and version. This beta version of the core Workloads API contains the Deployment, DaemonSet, ReplicaSet, and StatefulSet kinds, and it is the version we plan to promote to GA in the 1.9 release provided the feedback is positive.

In 1.9 Deployments graduated to the apps/v1 group version but apps/v1beta2 is still supported :

In the 1.9 release, we plan to introduce the apps/v1 group version. We intend to promote the apps/v1beta2 group version in its entirety to apps/v1 and to deprecate apps/v1beta2 at that time.

Your extensions/v1beta1 Deployment is working because each version maintains backwards compatibility, for a time:

We realize that even after the release of apps/v1, users will need time to migrate their code from extensions/v1beta1, apps/v1beta1, and apps/v1beta2. It is important to remember that the minimum support durations listed in the deprecations guidelines are minimums. We will continue to support conversion between groups and versions until users have had sufficient time to migrate.

See https://v1-10.docs.kubernetes.io/docs/reference/workloads-18-19/ for details.

dmcnaught commented 6 years ago

Thanks! Since we've been using deis to deploy most of our apps, I haven't kept in touch with api versions very well. The interaction here between hpa and deployments means that backwards compatibility is a little more complex in this case.

dmcnaught commented 6 years ago

@Cryptophobia I think this issue could open a can of worms that will need addressing sooner rather than later - the apiVersions used by deis - many of them are probably deprecated now (like in this issue workaround) and hephy should be updated to use the latest k8s apiVersions.. probably one of the highest priority tasks. @kingdonb

kingdonb commented 6 years ago

In the latest kubectl versions we've heard that kubectl api-versions returns just

v1

...as in literally api: map[v1:{}] -- no values in the map at all, just a single v1 key. This is what you should expect to find on k8s clusters at >v1.10

This PR was an approach that didn't always work: https://github.com/teamhephy/controller/pull/72/files because GKE does not reliably report on KubeVersion in a semver compliant way.

Check out this one: https://github.com/teamhephy/controller/pull/73/files

This is what's in Hephy v2.19.4 now. It reads the contents of .Capabilities.APIVersions directly and uses the same approach we found in Prometheus Operator. This has worked on every cluster we've tested against.

https://github.com/coreos/prometheus-operator/issues/1714

This still seems to be the same approach they're using today: https://github.com/coreos/prometheus-operator/blob/master/helm/alertmanager/templates/psp-clusterrole.yaml

Does that help @dmcnaught ? 🥇

dmcnaught commented 6 years ago

I'm still seeing the problem with deis autoscale not working in hephy 2.19.4 on K8s 1.10 - both fresh installs @kingdonb

kingdonb commented 6 years ago

So there may be more places that opt into or check for particular API versions than we've identified. I'm thinking they will be in the code, rather than inside of the chart templates.

Cryptophobia commented 6 years ago

@dmcnaught :+1:

Looks like the issue here was not that Hephy was using the wrong HPA api version, but that deployments api compatibility has changed.

In the controller's HPA code we have this:

    def api_version(self):
        # API location changes between versions
        # http://kubernetes.io/docs/user-guide/horizontal-pod-autoscaling/#api-object
        if self.version() >= parse("1.3.0"):
            return 'autoscaling/v1'

        # 1.2 and older
        return 'extensions/v1beta1'

HPA is already assuming autoscaling/v1 .

Looks like in the controller Deployment code we have this:

`class Deployment(Resource):
    api_prefix = 'apis'
    api_version = 'extensions/v1beta1'

Sounds that this will need to be migrated to apps/v1 . What are the implications in terms of backwards compatibility and such? I am really not sure.

kingdonb commented 6 years ago

Sounds like a feature for v2.20 (it's a breaking change, I'm comfortable breaking compatibility with older than k8s v1.3 at this point, if that's what this means... unless anyone has strong objections...).

Alternatively if there is someone who maintains a much older k8s cluster with horizontal pod autoscaling and wants to help us confirm that we haven't broken it with the new release, then we could do that. My oldest cluster is at v1.5 and actually it looks like we just need a cluster <v1.9 so that might work, since if I'm reading this right, that's when deployment became apps/v1.

I don't have strong feelings about maintaining backwards compatibility forever. I'd much prefer that somebody forces me to upgrade my v1.5 cluster. The only reason I kept it at v1.5 was because I feared the implications of turning on RBAC... and at this time, even standard basic minikube installs come with RBAC enabled. (That v1.5 cluster wasn't running workflow either, so I wouldn't be harmed or feeling "put out" by such a change in the slightest.)

I think we would be better served by making the small breaking change and adding a prominent note about the minimum supported version being at v1.9, pushing our users toward the future. I consider that it is a relatively recent version, but on the other hand, the oldest version that you can even request to install on GKE today is v1.9.6.

Cryptophobia commented 6 years ago

I think we would be better served by making the small breaking change and adding a prominent note about the minimum supported version being at v1.9, pushing our users toward the future. I consider that it is a relatively recent version, but on the other hand, the oldest version that you can even request to install on GKE today is v1.9.6.

I think this is good. We should make notes about what minimum version we support if we go forward and break backwards compatibility for sure.

Cryptophobia commented 5 years ago

This is fixed in https://github.com/teamhephy/controller/pull/106