miracle2k / k8s-snapshots

Automatic Volume Snapshots on Kubernetes.
BSD 2-Clause "Simplified" License
350 stars 66 forks source link

docs: CustomResourceDefinition is mandatory? + A tip for kops users #34

Open mrtyler opened 6 years ago

mrtyler commented 6 years ago

Hello,

I wanted to check my understanding on a couple things before I offer a PR.

CustomResourceDefinition is mandatory?

I have a k8s 1.8.0 cluster. Following the README I deployed k8s-snapshots (both v2.0 and dev), annotated a Persistent Volume, and hit this error:

2017-10-05T04:02:34.136420193Z requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://100.64.0.1:443/apis/k8s-snapshots.elsdoerfer.com/v1/snapshotrules?watch=true

k8s-snapshots.elsdoerfer.com was not in the output of curl https://100.64.0.1:443/apis/. I deployed the CustomResourceDefinition from the "Manual snapshot rules" section later in the README and the error went away.

Is it expected that the CRD is mandatory, at least in k8s 1.7+? If so, I'll update the docs. If not, I can provide more detail about my setup if this is a bug worth investigating.

A tip for kops users

We use kops to manage our k8s cluster. k8s-snapshots didn't work out of the box due to a permissions issue. If you agree, I'd like to add a tip about this to the README for fellow kops users:

k8s-snapshots need EBS and S3 permissions to take and save snapshots. Under the kops IAM Role scheme, only Masters have these permissions. The easiest solution is to run k8s-snapshots on Masters.

To run on a Master, we need to:

To do this, add the following to the above manifest for the k8s-snapshots Deployment:

spec:
  ...
  template:
  ...
    spec:
      ...
      tolerations:
      - key: "node-role.kubernetes.io/master"
        operator: "Equal"
        value: ""
        effect: "NoSchedule"
      nodeSelector:
        kubernetes.io/role: master

Thanks

k8s-snapshots is cool! :)

miracle2k commented 6 years ago

The note about kops is great, I would definitely merge that pr.

CustomResourceDefinition is not mandatory. You should be able to ignore the error in the logs just fine. Backup disks then need to be annotated as described in the docs.

mrtyler commented 6 years ago

PR for kops tip in README: https://github.com/miracle2k/k8s-snapshots/pull/35

On the CRD thing: You are correct that scheduled snapshots work in spite of the error (I must have failed to wait long enough during my testing :)).

However, the stack trace is kind of ominous. Here it is after starting k8s-snapshots:dev on my cluster without the CRD.

2017-10-08T15:04:13.047876109Z 2017-10-08T15:04:13.046730Z rule.heartbeat                 [k8s_snapshots.core] message=rule.heartbeat rules=None severity=INFO
2017-10-08T15:04:13.049515787Z 2017-10-08T15:04:13.048665Z kube-config.from-service-account [k8s_snapshots.context] message=kube-config.from-service-account severity=INFO
2017-10-08T15:04:13.07426379Z 2017-10-08T15:04:13.072241Z volume-event.received          [k8s_snapshots.core] event_object={'kind': 'PersistentVolume', 'apiVersion': 'v1', 'metadata': {'name': 'couchdb-pv', 'selfLink': '/api/v1/persistentvolumes/couchdb-pv', 'uid': 'd9c3b5a7-ac39-11e7-85b1-061d4acbdfa0', 'resourceVersion': '6386', 'creationTimestamp': '2017-10-08T15:03:32Z', 'labels': {'failure-domain.beta.kubernetes.io/region': 'us-east-2', 'failure-domain.beta.kubernetes.io/zone': 'us-east-2a'}, 'annotations': {'kubectl.kubernetes.io/last-applied-configuration': '{"apiVersion":"v1","kind":"PersistentVolume","metadata":{"annotations":{},"name":"couchdb-pv","namespace":""},"spec":{"accessModes":["ReadWriteOnce"],"awsElasticBlockStore":{"volumeID":"vol-026057a3270dc432b"},"capacity":{"storage":"5Gi"},"storageClassName":"gpii-default"}}\n', 'pv.kubernetes.io/bound-by-controller': 'yes'}}, 'spec': {'capacity': {'storage': '5Gi'}, 'awsElasticBlockStore': {'volumeID': 'vol-026057a3270dc432b'}, 'accessModes': ['ReadWriteOnce'], 'claimRef': {'kind': 'PersistentVolumeClaim', 'namespace': 'default', 'name': 'couchdb-pvc', 'uid': 'da4cdb5d-ac39-11e7-9d66-0adf15f46cfa', 'apiVersion': 'v1', 'resourceVersion': '6384'}, 'persistentVolumeReclaimPolicy': 'Retain', 'storageClassName': 'gpii-default'}, 'status': {'phase': 'Bound'}} event_type=ADDED message=volume-event.received: event_type='ADDED', event_object.metadata.name='couchdb-pv' severity=INFO
2017-10-08T15:04:13.23814782Z 2017-10-08T15:04:13.236536Z rule.added                     [k8s_snapshots.core] event_object={'kind': 'PersistentVolume', 'apiVersion': 'v1', 'metadata': {'name': 'couchdb-pv', 'selfLink': '/api/v1/persistentvolumes/couchdb-pv', 'uid': 'd9c3b5a7-ac39-11e7-85b1-061d4acbdfa0', 'resourceVersion': '6386', 'creationTimestamp': '2017-10-08T15:03:32Z', 'labels': {'failure-domain.beta.kubernetes.io/region': 'us-east-2', 'failure-domain.beta.kubernetes.io/zone': 'us-east-2a'}, 'annotations': {'kubectl.kubernetes.io/last-applied-configuration': '{"apiVersion":"v1","kind":"PersistentVolume","metadata":{"annotations":{},"name":"couchdb-pv","namespace":""},"spec":{"accessModes":["ReadWriteOnce"],"awsElasticBlockStore":{"volumeID":"vol-026057a3270dc432b"},"capacity":{"storage":"5Gi"},"storageClassName":"gpii-default"}}\n', 'pv.kubernetes.io/bound-by-controller': 'yes'}}, 'spec': {'capacity': {'storage': '5Gi'}, 'awsElasticBlockStore': {'volumeID': 'vol-026057a3270dc432b'}, 'accessModes': ['ReadWriteOnce'], 'claimRef': {'kind': 'PersistentVolumeClaim', 'namespace': 'default', 'name': 'couchdb-pvc', 'uid': 'da4cdb5d-ac39-11e7-9d66-0adf15f46cfa', 'apiVersion': 'v1', 'resourceVersion': '6384'}, 'persistentVolumeReclaimPolicy': 'Retain', 'storageClassName': 'gpii-default'}, 'status': {'phase': 'Bound'}} event_type=ADDED message=rule.added: rule.name='pvc-couchdb-pvc' rule=Rule(name='pvc-couchdb-pvc', deltas=[datetime.timedelta(0, 300), datetime.timedelta(0, 900), datetime.timedelta(0, 2700)], backend='aws', disk=AWSDiskIdentifier(region='us-east-2', volume_id='vol-026057a3270dc432b'), source='/api/v1/namespaces/default/persistentvolumeclaims/couchdb-pvc') severity=INFO
2017-10-08T15:04:15.062602774Z 2017-10-08T15:04:15.061155Z volume-event.received          [k8s_snapshots.core] event_object={'kind': 'PersistentVolumeClaim', 'apiVersion': 'v1', 'metadata': {'name': 'couchdb-pvc', 'namespace': 'default', 'selfLink': '/api/v1/namespaces/default/persistentvolumeclaims/couchdb-pvc', 'uid': 'da4cdb5d-ac39-11e7-9d66-0adf15f46cfa', 'resourceVersion': '6388', 'creationTimestamp': '2017-10-08T15:03:33Z', 'annotations': {'backup.kubernetes.io/deltas': 'PT5M PT15M PT45M', 'kubectl.kubernetes.io/last-applied-configuration': '{"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{"backup.kubernetes.io/deltas":"PT5M PT15M PT45M"},"name":"couchdb-pvc","namespace":"default"},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"5Gi"}},"storageClassName":"gpii-default"}}\n', 'pv.kubernetes.io/bind-completed': 'yes', 'pv.kubernetes.io/bound-by-controller': 'yes'}}, 'spec': {'accessModes': ['ReadWriteOnce'], 'resources': {'requests': {'storage': '5Gi'}}, 'volumeName': 'couchdb-pv', 'storageClassName': 'gpii-default'}, 'status': {'phase': 'Bound', 'accessModes': ['ReadWriteOnce'], 'capacity': {'storage': '5Gi'}}} event_type=ADDED message=volume-event.received: event_type='ADDED', event_object.metadata.name='couchdb-pvc' severity=INFO
2017-10-08T15:04:15.092107689Z 2017-10-08T15:04:15.090550Z rule.added                     [k8s_snapshots.core] event_object={'kind': 'PersistentVolumeClaim', 'apiVersion': 'v1', 'metadata': {'name': 'couchdb-pvc', 'namespace': 'default', 'selfLink': '/api/v1/namespaces/default/persistentvolumeclaims/couchdb-pvc', 'uid': 'da4cdb5d-ac39-11e7-9d66-0adf15f46cfa', 'resourceVersion': '6388', 'creationTimestamp': '2017-10-08T15:03:33Z', 'annotations': {'backup.kubernetes.io/deltas': 'PT5M PT15M PT45M', 'kubectl.kubernetes.io/last-applied-configuration': '{"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{"backup.kubernetes.io/deltas":"PT5M PT15M PT45M"},"name":"couchdb-pvc","namespace":"default"},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"5Gi"}},"storageClassName":"gpii-default"}}\n', 'pv.kubernetes.io/bind-completed': 'yes', 'pv.kubernetes.io/bound-by-controller': 'yes'}}, 'spec': {'accessModes': ['ReadWriteOnce'], 'resources': {'requests': {'storage': '5Gi'}}, 'volumeName': 'couchdb-pv', 'storageClassName': 'gpii-default'}, 'status': {'phase': 'Bound', 'accessModes': ['ReadWriteOnce'], 'capacity': {'storage': '5Gi'}}} event_type=ADDED message=rule.added: rule.name='pvc-couchdb-pvc' rule=Rule(name='pvc-couchdb-pvc', deltas=[datetime.timedelta(0, 300), datetime.timedelta(0, 900), datetime.timedelta(0, 2700)], backend='aws', disk=AWSDiskIdentifier(region='us-east-2', volume_id='vol-026057a3270dc432b'), source='/api/v1/namespaces/default/persistentvolumeclaims/couchdb-pvc') severity=INFO
2017-10-08T15:04:16.066642947Z 2017-10-08T15:04:16.064487Z watch-resources.worker.error   [k8s_snapshots.kube] message=watch-resources.worker.error resource_type_name=SnapshotRule severity=ERROR
2017-10-08T15:04:16.066674797Z Traceback (most recent call last):
2017-10-08T15:04:16.066679198Z   File "/usr/local/lib/python3.6/site-packages/k8s_snapshots-0.0.0-py3.6.egg/k8s_snapshots/kube.py", line 181, in worker
2017-10-08T15:04:16.066682547Z     for event in sync_iterator:
2017-10-08T15:04:16.066685883Z   File "/usr/local/lib/python3.6/site-packages/pykube/query.py", line 156, in object_stream
2017-10-08T15:04:16.066688995Z     self.api.raise_for_status(r)
2017-10-08T15:04:16.066691617Z   File "/usr/local/lib/python3.6/site-packages/pykube/http.py", line 99, in raise_for_status
2017-10-08T15:04:16.06669461Z     resp.raise_for_status()
2017-10-08T15:04:16.066698262Z   File "/usr/local/lib/python3.6/site-packages/requests/models.py", line 935, in raise_for_status
2017-10-08T15:04:16.066701479Z     raise HTTPError(http_error_msg, response=self)
2017-10-08T15:04:16.066713656Z requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://100.64.0.1:443/apis/k8s-snapshots.elsdoerfer.com/v1/snapshotrules?watch=true

This looks a little scary, so I suggest suppressing the error (e.g. "No CustomResourceDefinition found, but that is ok" instead of a dozen lines of traceback). I don't have time to PR it myself, though I can make a separate issue for this problem if you like.

Regardless, Google can now find this error message in this issue so hopefully future users will be less frightened of this stack trace than I was :).