rancher / rancher

Complete container management platform
http://rancher.com
Apache License 2.0
22.78k stars 2.92k forks source link

monitoring v2 fails too deploy after disabling monitoring v1 on a cluster #32499

Closed sowmyav27 closed 3 years ago

sowmyav27 commented 3 years ago

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):

tion":"Resources represents the minimum resources the volume should have. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#resources","properties":{"limits":{"additionalProperties":{"anyOf":[{"type":"integer"},{"type":"string"}],"pattern":"^(\\+|-)?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\\+|-)?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))))?$","x-kubernetes-int-or-string":true},"description":"Limits describes the maximum amount of compute resources allowed. More info: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/","type":"object"},"requests":{"additionalProperties":{"anyOf":[{"type":"integer"},{"type":"string"}],"pattern":"^(\\+|-)?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\\+|-)?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))))?$","x-kubernetes-int-or-string":true},"description":"Requests describes the minimum amount of compute resources required. If Requests is omitted for a container, it defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value. More info: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/","type":"object"}},"type":"object"},"selector":{"description":"A label query over volumes to consider for binding.","properties":{"matchExpressions":{"description":"matchExpressions is a list of label selector requirements. The requirements are ANDed.","items":{"description":"A label selector requirement is a selector that contains values, a key, and an operator that relates the key and values.","properties":{"key":{"description":"key is the label key that the selector applies to.","type":"string"},"operator":{"description":"operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist.","type":"string"},"values":{"description":"values is an array of string values. If the operator is In or NotIn, the values array must be non-empty. If the operator is Exists or Doe
sNotExist, the values array must be empty. This array is replaced during a strategic merge patch.","items":{"type":"string"},"type":"array"}},"required":["key","operator"],"type":"object"},"type":"array"},"matchLabels":{"additionalProperties":{"type":"string"},"description":"matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels map is equivalent to an element of matchExpressions, whose key field is \"key\", the operator is \"In\", and the values array contains only \"value\". The requirements are ANDed.","type":"object"}},"type":"object"},"storageClassName":{"description":"Name of the StorageClass required by the claim. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#class-1","type":"string"},"volumeMode":{"description":"volumeMode defines what type of volume is required by the claim. Value of Filesystem is implied when not included in claim spec.","type":"string"},"volumeName":{"description":"VolumeName is the binding reference to the PersistentVolume backing this claim.","type":"string"}},"type":"object"},"status":{"description":"Status represents the current information/status of a persistent volume claim. Read-only. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#persistentvolumeclaims","properties":{"accessModes":{"description":"AccessModes contains the actual access modes the volume backing the PVC has. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#access-modes-1","items":{"type":"string"},"type":"array"},"capacity":{"additionalProperties":{"anyOf":[{"type":"integer"},{"type":"string"}],"pattern":"^(\\+|-)?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\\+|-)?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))))?$","x-kubernetes-int-or-string":true},"description":"Represents the actual resources of the underlying volume.","type":"object"},"conditions":{"description":"Current Condition of persistent volume claim. If underlying persistent volume is being resized then the Condition will be set to 'Resi
zeStarted'.","items":{"description":"PersistentVolumeClaimCondition contails details about state of pvc","properties":{"lastProbeTime":{"description":"Last time we probed the condition.","format":"date-time","type":"string"},"lastTransitionTime":{"description":"Last time the condition transitioned from one status to another.","format":"date-time","type":"string"},"message":{"description":"Human-readable message indicating details about last transition.","type":"string"},"reason":{"description":"Unique, this should be a short, machine understandable string that gives the reason for condition's last transition. If it reports \"ResizeStarted\" that means the underlying persistent volume is being resized.","type":"string"},"status":{"type":"string"},"type":{"description":"PersistentVolumeClaimConditionType is a valid value of PersistentVolumeClaimCondition.Type","type":"string"}},"required":["status","type"],"type":"object"},"type":"array"},"phase":{"description":"Phase represents the current phase of PersistentVolumeClaim.","type":"string"}},"type":"object"}},"type":"object"}},"type":"object"},"tag":{"description":"Tag of Prometheus container image to be deployed. Defaults to the value of `version`. Version is ignored if Tag is set. Deprecated: use 'image' instead.  The image tag can be specified as part of the image URL.","type":"string"},"thanos":{"description":"Thanos configuration allows configuring various aspects of a Prometheus server in a Thanos environment. \n This section is experimental, it may change significantly without deprecation notice in any release. \n This is experimental and may change significantly without backward compatibility in any release.","properties":{"baseImage":{"description":"Thanos base image if other than default. Deprecated: use 'image' instead","type":"string"},"grpcServerTlsConfig":{"description":"GRPCServerTLSConfig configures the gRPC server from which Thanos Querier reads recorded rule data. Note: Currently only the CAFile, CertFile, and KeyFile fields are supported. Maps to
the '--grpc-server-tls-*' CLI args.","properties":{"ca":{"description":"Struct containing the CA cert to use for the targets.","properties":{"configMap":{"description":"ConfigMap containing data to use for the targets.","properties":{"key":{"description":"The key to select.","type":"string"},"name":{"description":"Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names TODO: Add other useful fields. apiVersion, kind, uid?","type":"string"},"optional":{"description":"Specify whether the ConfigMap or its key must be defined","type":"boolean"}},"required":["key"],"type":"object"},"secret":{"description":"Secret containing data to use for the targets.","properties":{"key":{"description":"The key of the secret to select from.  Must be a valid secret key.","type":"string"},"name":{"description":"Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names TODO: Add other useful fields. apiVersion, kind, uid?","type":"string"},"optional":{"description":"Specify whether the Secret or its key must be defined","type":"boolean"}},"required":["key"],"type":"object"}},"type":"object"},"caFile":{"description":"Path to the CA cert in the Prometheus container to use for the targets.","type":"string"},"cert":{"description":"Struct containing the client cert file for the targets.","properties":{"configMap":{"description":"ConfigMap containing data to use for the targets.","properties":{"key":{"description":"The key to select.","type":"string"},"name":{"description":"Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names TODO: Add other useful fields. apiVersion, kind, uid?","type":"string"},"optional":{"description":"Specify whether the ConfigMap or its key must be defined","type":"boolean"}},"required":["key"],"type":"object"},"secret":{"description":"Secret containing data to use for the targets."

Note: Workaround: Delete monitoring v2 CRD and re install it. It deploys.

Environment information

Jono-SUSE-Rancher commented 3 years ago

@sowmyav27 - Is this a new failure? (Regression from 2.5.6-->2.5.8) or has this always been happening?

sowmyav27 commented 3 years ago

I dont see this issue on 2.5.7 using monitoring v1 version - 0.2.1 and monitoring v2 version: 9.4.203

Jono-SUSE-Rancher commented 3 years ago

This feels like a regression if it's not happening in 2.5.7. Thanks for giving me the details Sowmya, I'll do some research and figure out where we want to be with it.

aiyengar2 commented 3 years ago

@sowmyav27 seems like the logs posted above are a partial dump. Do you have access to the full logs? Specifically, at the very end of the logs there should be an error that indicates exactly which property within the CRD upgrade was rejected iirc.

Also I don't think this is a regression but rather an artifact of trying to install v0.39.0 Prometheus Operator CRDs for Monitoring V1 (which are still using non-structural schemas based on apiextensions.k8s.io/v1beta) and then trying to upgrade to v0.45.0 CRDs for Monitoring V2 (which are using structural schemas based on apiextensions.k8s.io/v1).

This is similar to the issue we found when trying to upgrade Monitoring V2 from 9.4.203 to 14.5.100, which was resolved by https://github.com/rancher/charts/pull/1131. This was due to an issue with upgrading from Prometheus Operator CRDs of v0.38.1 to v0.45.0.

aiyengar2 commented 3 years ago

@sowmyav27 seems like the logs posted above are a partial dump. Do you have access to the full logs? Specifically, at the very end of the logs there should be an error that indicates exactly which property within the CRD upgrade was rejected iirc.

Spoke offline and we don't have those logs anymore, I'll attempt to reproduce and add the full logs to this ticket.

aiyengar2 commented 3 years ago

Reproduced this issue. The following logs were printed from the rancher-monitoring-crd-create pod and the pod keeps restarting.

set-preserve-unknown-fields-false:

Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "alertmanagerconfigs.monitoring.coreos.com" not found
The CustomResourceDefinition "alertmanagers.monitoring.coreos.com" is invalid: spec.versions[0].schema.openAPIV3Schema: Required value: schemas are required
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "probes.monitoring.coreos.com" not found
The CustomResourceDefinition "prometheuses.monitoring.coreos.com" is invalid: spec.versions[0].schema.openAPIV3Schema: Required value: schemas are required
The CustomResourceDefinition "prometheusrules.monitoring.coreos.com" is invalid: spec.versions[0].schema.openAPIV3Schema: Required value: schemas are required
The CustomResourceDefinition "servicemonitors.monitoring.coreos.com" is invalid: spec.versions[0].schema.openAPIV3Schema: Required value: schemas are required

create-crds

create-crds.log

Resource: "apiextensions.k8s.io/v1, Resource=customresourcedefinitions", GroupVersionKind: "apiextensions.k8s.io/v1, Kind=CustomResourceDefinition"
Name: "prometheuses.monitoring.coreos.com", Namespace: ""
for: "/etc/config/crd-manifest.yaml": CustomResourceDefinition.apiextensions.k8s.io "prometheuses.monitoring.coreos.com" is invalid: spec.preserveUnknownFields: Invalid value: true: must be false in order to use defaults in the schema
aiyengar2 commented 3 years ago

Also I don't think this is a regression but rather an artifact of trying to install v0.39.0 Prometheus Operator CRDs for Monitoring V1 (which are still using non-structural schemas based on apiextensions.k8s.io/v1beta) and then trying to upgrade to v0.45.0 CRDs for Monitoring V2 (which are using structural schemas based on apiextensions.k8s.io/v1).

Correction: the reason why this issue is happening is because the patch step attempts to convert a structural schema (post upgrade and uninstall of V1) into a non-structural schema temporarily. This fails to remove preserveUnknownFields, which is why we see this issue in Monitoring V1 -> V2 migrations.

Testing out a fix, will get a PR out soon.

aiyengar2 commented 3 years ago

@sowmyav27 seems like not overriding the schema did the trick, but since this affects the CRD installation process we should double check that all test cases involving CRD installs work as expected.

For testing this fix, I tried 9.4.203 upgrade to 14.5.00, a clean install, and a Monitoring V1 to Monitoring V2 upgrade and those seemed to work as expected now.

sowmyav27 commented 3 years ago

On 2.5.8-rc16

digiserg commented 3 years ago

I just tried deploying 14.5.100 to a brand new cluster and getting the same error as reported here:

rancher-monitoring-crd-create-xt4vj create-crds to:
rancher-monitoring-crd-create-xt4vj create-crds Resource: "apiextensions.k8s.io/v1, Resource=customresourcedefinitions", GroupVersionKind: "apiextensions.k8s.io/v1, Kind=CustomResourceDefinition"
rancher-monitoring-crd-create-xt4vj create-crds Name: "prometheuses.monitoring.coreos.com", Namespace: ""
rancher-monitoring-crd-create-xt4vj create-crds for: "/etc/config/crd-manifest.yaml": CustomResourceDefinition.apiextensions.k8s.io "prometheuses.monitoring.coreos.com" is invalid: spec.preserveUnknownFields: Invalid value: true: must be false in order to use defaults in the schema
aiyengar2 commented 3 years ago

@digiserg we have an issue for this here https://github.com/rancher/rancher/issues/32827 but the workaround is to just try redeploying the CRD chart / original chart. It seems to be a race condition.