Daemon pods keep failing

jbonnett92 commented 5 years ago

Hi, I am trying to install StorageOS on my Kubernetes cluster, Kubernetes is installed in CoreOS. Looking at the logs the one line that stands out is: panic: runtime error: slice bounds out of range

time="2019-01-02T12:22:39Z" level=info msg="by using this product, you are agreeing to the terms of the StorageOS Ltd. End User Subscription Agreement (EUSA) found at: https://eusa.storageos.com" module=command
time="2019-01-02T12:22:39Z" level=info msg=starting address=********* hostname=clust1-worker-1 id=856a20dd-1e6f-aabf-a152-fa3c4f5dc5a5 join="*********,*********" module=command version="StorageOS 1.0.2 (13c3612), built: 2018-12-07T140018Z"
panic: runtime error: slice bounds out of range
 goroutine 1 [running]:
code.storageos.net/storageos/control/vendor/github.com/aws/aws-sdk-go/aws/ec2metadata.(*EC2Metadata).Region(0xc4200c4650, 0x0, 0xc420296760, 0xc4208e9080, 0xc4204bcc20)
    /go/src/code.storageos.net/storageos/control/vendor/github.com/aws/aws-sdk-go/aws/ec2metadata/api.go:122 +0xa3
code.storageos.net/storageos/control/integration/iaas/ec2.(*Provider).GetZone(0xc4202997e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xffffffffffffffff, 0x20002)
    /go/src/code.storageos.net/storageos/control/integration/iaas/ec2/provider.go:112 +0x59
code.storageos.net/storageos/control/controlplane.applyProviderMetadata(0xc4201df900, 0x1fc0600, 0xc4202997e0, 0x0, 0xc4204416b0)
    /go/src/code.storageos.net/storageos/control/controlplane/node.go:234 +0x1ea
code.storageos.net/storageos/control/controlplane.applyMetadata(0xc4201df900, 0x9, 0x1f94b90)
    /go/src/code.storageos.net/storageos/control/controlplane/node.go:229 +0x66
code.storageos.net/storageos/control/controlplane.writeNodeBootstrapConfig(0x1fc07c0, 0xc42037b620, 0xc420279900, 0xc42000e178, 0xc420440060)
    /go/src/code.storageos.net/storageos/control/controlplane/node.go:194 +0xd60
code.storageos.net/storageos/control/controlplane.Create(0xc420279900, 0xc4204d0fe0, 0xc42000e178, 0x1f97820, 0xc4200c4a38, 0x0, 0x1f8a7c8, 0x7)
    /go/src/code.storageos.net/storageos/control/controlplane/server.go:182 +0xb52
code.storageos.net/storageos/control/command/server.(*Command).Run(0xc4201968c0, 0xc4200ca010, 0x0, 0x0, 0xc4204d09c0)
    /go/src/code.storageos.net/storageos/control/command/server/command.go:122 +0x8f4
code.storageos.net/storageos/control/vendor/github.com/mitchellh/cli.(*CLI).Run(0xc4201b0a00, 0xc4201b0a00, 0xc4204d0a00, 0x0)
    /go/src/code.storageos.net/storageos/control/vendor/github.com/mitchellh/cli/cli.go:255 +0x1eb
main.realMain(0xc4200a6058)
    /go/src/code.storageos.net/storageos/control/main.go:38 +0x126
main.main()
    /go/src/code.storageos.net/storageos/control/main.go:26 +0x22

Any ideas what could be causing this?

Thanks, Jamie

jbonnett92 commented 5 years ago

@domodwyer Yeah my servers are not on AWS.

domodwyer commented 5 years ago

Hi @jbonnett92

The above PR was fixing a broader issue with the AWS SDK (panicking when receiving an unexpected response to a "get region" EC2 metadata request) that should also resolve your ticket.

I believe this only happens because your servers are not in AWS, as within AWS you would assume the SDK always receives a valid response. Do you happen to have a HTTP server listening on 169.254.169.254 in your environment?

Dom

jbonnett92 commented 5 years ago

@domodwyer I was just confirming that I wasn't :) not sure, I deleted my cluster. I will check soon.

domodwyer commented 5 years ago

Hey @jbonnett92

We should have a release out today/tomorrow that fixes this for you :) If it doesn't please let me know!

Thanks for taking the time to open a ticket.

Dom

jbonnett92 commented 5 years ago

@domodwyer Could you update the helm chart for it too please? Unless it uses this repository as a resource?

domodwyer commented 5 years ago

Hey @jbonnett92 - we absolutely will!

I'll leave this ticket open until the release is out and the chart has been updated so you know when it's all sorted 👍

Dom

jbonnett92 commented 5 years ago

@domodwyer Looks like your change to aws-sdk-go has been merged 👍

actionbuddha commented 5 years ago

Hi @jbonnett92 , just to let you know we are preparing a release now and expect to release next Monday or Tuesday, at which point we'll update the Helm charts and documentation. I'll ping a final update when that's done.

jbonnett92 commented 5 years ago

@actionbuddha Thank you

jbonnett92 commented 5 years ago

@actionbuddha I noticed the recent merge #253, does that mean it is ready for me to use?

actionbuddha commented 5 years ago

Hi @jbonnett92 yes, 1.1.0 is now released - please do go ahead and try it. Can you confirm back here if it solves your issue plz?

jbonnett92 commented 5 years ago

@actionbuddha It doesn't appear that version exists in the helm chart, even after a helm repo update.

Although I looked at the latest merge to the chart repo and noticed a version number of 0.2.10, I'm guessing this is the one I should use?

If so I get this: With the master IP added to the join parameter

time="2019-01-11T01:28:44Z" level=info msg="by using this product, you are agreeing to the terms of the StorageOS Ltd. End User Subscription Agreement (EUSA) found at: https://eusa.storageos.com" module=command
time="2019-01-11T01:28:44Z" level=info msg=starting address=**.***.***.*** hostname=c1w1c id=2e2539fe-77c7-e40e-aa8f-cf0b37da0291 join="10.8.96.3,10.8.96.4,10.8.96.5" module=command version="StorageOS 1.1.0 (5e8ccdf), built: 2019-01-03T160821Z"
time="2019-01-11T01:28:50Z" level=info msg="kv store ready" action=wait address="http://127.0.0.1:5706" backend=embedded category=etcd module=cp
time="2019-01-11T01:28:50Z" level=info msg="this cluster is configured to send anonymous usage data to help us develop StorageOS (https://docs.storageos.com/docs/reference/telemetry)" module=cp
time="2019-01-11T01:28:50Z" level=info msg="joining scheduler elections" module=scheduler
time="2019-01-11T01:28:50Z" level=info msg="became leader, initialising" module=scheduler
time="2019-01-11T01:28:50Z" level=info msg="started leader tasks" action=establish category=leader module=scheduler term=1
time="2019-01-11T01:28:51Z" level=info msg="liocheck: OK" category=fslio module=dataplane proc=liocheck
time="2019-01-11T01:28:51Z" level=info msg="startup complete - ready for operation" module=command
time="2019-01-11T01:31:51Z" level=error msg="timeout accessing kv store" action=get category=client error="context deadline exceeded" key=nameidx/locks/maintenance module=store retry_count=0
time="2019-01-11T01:31:52Z" level=error msg="failed to retrieve maintenance mode status, skipping health update" action=establish category=leader error="context deadline exceeded" module=scheduler term=1
time="2019-01-11T01:31:52Z" level=error msg="failed to handle node health changes" category=leader error="context deadline exceeded" module=scheduler term=1
time="2019-01-11T01:31:55Z" level=error msg="failed to read a36fbcdccb7ac318 on stream Message (read tcp **.***.***.***:51380->**.***.***.***:5707: i/o timeout)" category=etcdserver module=store
time="2019-01-11T01:31:55Z" level=error msg="[cas storageos/locks/scheduler]: kvdb error: context deadline exceeded, retry count: 0\n" module=store
time="2019-01-11T01:31:55Z" level=error msg="lock operation failure" error="context deadline exceeded" key=locks/scheduler module=store-locks
time="2019-01-11T01:31:55Z" level=error msg="abandoning expired lock" key=locks/scheduler module=store-locks
time="2019-01-11T01:31:55Z" level=warning msg="lost leadership, stopping scheduler activities" module=scheduler
time="2019-01-11T01:31:55Z" level=info msg="leader told to stop, cancelling context" category=leader module=scheduler term=1
time="2019-01-11T01:31:55Z" level=info msg="leader tasks stopped" action=revoke category=leader module=scheduler term=1
time="2019-01-11T01:31:55Z" level=warning msg="volume watcher received error 'watch stopped'" module=watcher
time="2019-01-11T01:31:55Z" level=warning msg="node watcher received error 'watch stopped'" module=watcher
time="2019-01-11T01:31:55Z" level=error msg="failed to get node capacity stats" category=capacity error="nats: connection closed" module=taskrunner
time="2019-01-11T01:31:55Z" level=error msg="failed to get node capacity stats" category=capacity error="nats: connection closed" module=taskrunner
time="2019-01-11T01:32:00Z" level=info msg="received stop signal" action=start category=discovery module=ha service=node version=v1
time="2019-01-11T01:33:06Z" level=error msg="failed to read a36fbcdccb7ac318 on stream MsgApp v2 (read tcp **.***.***.***:51382->**.***.***.***:5707: i/o timeout)" category=etcdserver module=store
time="2019-01-11T01:33:07Z" level=error msg="timeout accessing kv store" action=get category=client error="context deadline exceeded" key=diagnostics/2e2539fe-77c7-e40e-aa8f-cf0b37da0291 module=store retry_count=0
time="2019-01-11T01:33:07Z" level=error msg="timeout accessing kv store" action=list category=client error="context deadline exceeded" module=store prefix=volumes/default/ retry_count=0
time="2019-01-11T01:33:07Z" level=error msg="failed to read a36fbcdccb7ac318 on stream Message (read tcp **.***.***.***:52812->**.***.***.***:5707: i/o timeout)" category=etcdserver module=store
time="2019-01-11T01:33:09Z" level=error msg="lock operation failure" error="context deadline exceeded" key=locks/scheduler module=store-locks
time="2019-01-11T01:33:11Z" level=error msg="timeout accessing kv store" action=list category=client error="context deadline exceeded" module=store prefix=nodes retry_count=0
time="2019-01-11T01:33:11Z" level=error msg="timeout accessing kv store" action=list category=client error="context deadline exceeded" module=store prefix=volumes retry_count=0

Without the master IP added to the join parameter

time="2019-01-11T01:36:49Z" level=info msg="by using this product, you are agreeing to the terms of the StorageOS Ltd. End User Subscription Agreement (EUSA) found at: https://eusa.storageos.com" module=command
time="2019-01-11T01:36:49Z" level=info msg=starting address=**.***.***.*** hostname=c1w2c id=934146f4-14ef-ee0e-2a25-734fbb6daaf2 join="10.8.96.4,10.8.96.5" module=command version="StorageOS 1.1.0 (5e8ccdf), built: 2019-01-03T160821Z"
time="2019-01-11T01:36:57Z" level=info msg="kv store ready" action=wait address="http://127.0.0.1:5706" backend=embedded category=etcd module=cp
time="2019-01-11T01:36:58Z" level=info msg="this cluster is configured to send anonymous usage data to help us develop StorageOS (https://docs.storageos.com/docs/reference/telemetry)" module=cp
time="2019-01-11T01:36:58Z" level=info msg="joining scheduler elections" module=scheduler
time="2019-01-11T01:36:59Z" level=info msg="liocheck: OK" category=fslio module=dataplane proc=liocheck
time="2019-01-11T01:36:59Z" level=info msg="startup complete - ready for operation" module=command
time="2019-01-11T01:38:04Z" level=info msg="became leader, initialising" module=scheduler
time="2019-01-11T01:38:04Z" level=info msg="started leader tasks" action=establish category=leader module=scheduler term=1
time="2019-01-11T01:39:17Z" level=error msg="timeout accessing kv store" action=list category=client error="context deadline exceeded" module=store prefix=volumes retry_count=0
time="2019-01-11T01:39:17Z" level=error msg="timeout accessing kv store" action=list category=client error="context deadline exceeded" module=store prefix=nodes retry_count=0
time="2019-01-11T01:39:17Z" level=error msg="timeout accessing kv store" action=list category=client error="context deadline exceeded" module=store prefix=volumes/default/ retry_count=0
time="2019-01-11T01:39:17Z" level=error msg="timeout accessing kv store" action=get category=client error="context deadline exceeded" key=diagnostics/934146f4-14ef-ee0e-2a25-734fbb6daaf2 module=store retry_count=0
time="2019-01-11T01:39:24Z" level=info msg="Etcd did not return any transaction responses for key (locks/scheduler)" module=store
time="2019-01-11T01:39:24Z" level=error msg="lock operation failure" error="value mismatch" key=locks/scheduler module=store-locks
time="2019-01-11T01:39:24Z" level=error msg="abandoning expired lock" key=locks/scheduler module=store-locks
time="2019-01-11T01:39:24Z" level=warning msg="lost leadership, stopping scheduler activities" module=scheduler
time="2019-01-11T01:39:24Z" level=info msg="leader told to stop, cancelling context" category=leader module=scheduler term=1
time="2019-01-11T01:39:24Z" level=info msg="leader tasks stopped" action=revoke category=leader module=scheduler term=1
time="2019-01-11T01:39:24Z" level=warning msg="volume watcher received error 'watch stopped'" module=watcher
time="2019-01-11T01:39:24Z" level=warning msg="node watcher received error 'watch stopped'" module=watcher
time="2019-01-11T01:39:24Z" level=info msg="received stop signal" action=start category=discovery module=ha service=node version=v1
time="2019-01-11T01:40:27Z" level=error msg="failed to read 2aa116e2b4865444 on stream Message (read tcp **.***.***.***:33772->**.***.***.***:5707: i/o timeout)" category=etcdserver module=store
time="2019-01-11T01:40:30Z" level=error msg="lock operation failure" error="context deadline exceeded" key=locks/scheduler module=store-locks

I replaced public IP's with * as my system isn't exactly secure yet.

darkowlzz commented 5 years ago

@jbonnett92 Hi, as per the logs, you're running StorageOS 1.1.0, that's from the helm chart 0.2.10 release. The failure seems to be related to connectivity issue with the embedded key value store. Usually when you have IPs in the join token, one of those IPs should be your current node's IP address. You've redacted the starting address (advertise IP), but you've some IPs in the join token. Can you check if the advertise IP of this node is in one of the IPs in the join token? If not that could be the reason for kv store connection timeouts. Also, I would recommend clearing /var/lib/storageos from all the nodes to ensure any old configurations are removed.

An alternative to avoid messing around with these IPs is to use the cluster operator, it will make the installation easier. You can follow the docs and try it out. Since you need the latest version, please add the following in the cluster spec when you try it:

apiVersion: "storageos.com/v1alpha1"
kind: "StorageOSCluster"
...
spec:
  ...
  images:
    nodeContainer: storageos/node:1.1.0
  ...

This will work with k8s 1.12 and below. Support for k8s 1.13 will be release soon.

jbonnett92 commented 5 years ago

Hi @darkowlzz, Sorry I just realised how confusing that sounded. So I have two networks all the nodes a public one and a private one.

For the join I added the private IP's of all the nodes (master and workers) for the first and only the worker nodes for the second in the join. I did also try the hostname, although had the same issues.

I am currently on Kubernetes 1.13.

darkowlzz commented 5 years ago

Private IPs should work fine. And this would work in k8s 1.13 as well, but not if you are installing using CSI. The in-tree plugin (non-CSI - default) installation should work fine on k8s 1.13.

Here's an example of a setup, hope it helps. Let's say I've 3 nodes with internal IPs 10.1.10.165, 10.1.10.166 and 10.1.10.167. If I run the installation right, the log would be something like:

starting address=10.1.10.167 hostname=test01 id=9341sdf2-2a25-734fbb6daaf2 join="10.1.10.165,10.1.10.166,10.1.10.167"

Can you verify that's what you get in the logs?

I'll check if there's anything else that could be causing this issue.

darkowlzz commented 5 years ago

@jbonnett92 can you also share the helm command with all the parameters you're running to install?

jbonnett92 commented 5 years ago

@darkowlzz Yes and the only other things installed is Calico CNI before StorageOS and GitLab after so that GitLab can use the StorageOS PVC. That line was in the logs above although it shows the public IP's for some reason.

Do you mean 3 nodes as in 1 master and 2 workers or 3 workers?

Here was the Helm command that I used: helm install storageos/storageos --name=storageos --version=0.2.10 --namespace=storageos --set cluster.join="10.8.96.3\,10.8.96.4\,10.8.96.5"

avestuk commented 5 years ago

@jbonnett92 What size is your cluster? When @darkowlzz mentioned 3 nodes that was referring to 3 worker nodes. StorageOS will only work as a single node cluster or as a three, or more, node cluster. This is because we use etcd to maintain consensus and it's not possible to get a consensus with a two node cluster.

My suggestion would be that you install StorageOS on three nodes, alternatively you can install StorageOS on a single node but this means you won't be able to use volume replicas, and therefore your volumes won't be highly available.

If you have further questions I'm also available on our public slack channel slack.storageos.com

jbonnett92 commented 5 years ago

@avestuk I have 1 master and 2 workers.

avestuk commented 5 years ago

@jbonnett92 then the only remaining option is to install StorageOS on a single node. A single node installation has some limitations but it'll work.

actionbuddha commented 5 years ago

Hi, closing this out because I believe the original issue was resolved. @jbonnett92 if you get any issues not related to this ticket, please do come and find us on our public Slack channel, or mail support@storageos.com. Thanks!

storageos / storageos.github.io

Daemon pods keep failing #251