Closed jbonnett92 closed 5 years ago
@domodwyer Yeah my servers are not on AWS.
Hi @jbonnett92
The above PR was fixing a broader issue with the AWS SDK (panicking when receiving an unexpected response to a "get region" EC2 metadata request) that should also resolve your ticket.
I believe this only happens because your servers are not in AWS, as within AWS you would assume the SDK always receives a valid response. Do you happen to have a HTTP server listening on 169.254.169.254
in your environment?
Dom
@domodwyer I was just confirming that I wasn't :) not sure, I deleted my cluster. I will check soon.
Hey @jbonnett92
We should have a release out today/tomorrow that fixes this for you :) If it doesn't please let me know!
Thanks for taking the time to open a ticket.
Dom
@domodwyer Could you update the helm chart for it too please? Unless it uses this repository as a resource?
Hey @jbonnett92 - we absolutely will!
I'll leave this ticket open until the release is out and the chart has been updated so you know when it's all sorted 👍
Dom
@domodwyer Looks like your change to aws-sdk-go has been merged 👍
Hi @jbonnett92 , just to let you know we are preparing a release now and expect to release next Monday or Tuesday, at which point we'll update the Helm charts and documentation. I'll ping a final update when that's done.
@actionbuddha Thank you
@actionbuddha I noticed the recent merge #253, does that mean it is ready for me to use?
Hi @jbonnett92 yes, 1.1.0 is now released - please do go ahead and try it. Can you confirm back here if it solves your issue plz?
@actionbuddha It doesn't appear that version exists in the helm chart, even after a helm repo update
.
Although I looked at the latest merge to the chart repo and noticed a version number of 0.2.10
, I'm guessing this is the one I should use?
If so I get this: With the master IP added to the join parameter
time="2019-01-11T01:28:44Z" level=info msg="by using this product, you are agreeing to the terms of the StorageOS Ltd. End User Subscription Agreement (EUSA) found at: https://eusa.storageos.com" module=command
time="2019-01-11T01:28:44Z" level=info msg=starting address=**.***.***.*** hostname=c1w1c id=2e2539fe-77c7-e40e-aa8f-cf0b37da0291 join="10.8.96.3,10.8.96.4,10.8.96.5" module=command version="StorageOS 1.1.0 (5e8ccdf), built: 2019-01-03T160821Z"
time="2019-01-11T01:28:50Z" level=info msg="kv store ready" action=wait address="http://127.0.0.1:5706" backend=embedded category=etcd module=cp
time="2019-01-11T01:28:50Z" level=info msg="this cluster is configured to send anonymous usage data to help us develop StorageOS (https://docs.storageos.com/docs/reference/telemetry)" module=cp
time="2019-01-11T01:28:50Z" level=info msg="joining scheduler elections" module=scheduler
time="2019-01-11T01:28:50Z" level=info msg="became leader, initialising" module=scheduler
time="2019-01-11T01:28:50Z" level=info msg="started leader tasks" action=establish category=leader module=scheduler term=1
time="2019-01-11T01:28:51Z" level=info msg="liocheck: OK" category=fslio module=dataplane proc=liocheck
time="2019-01-11T01:28:51Z" level=info msg="startup complete - ready for operation" module=command
time="2019-01-11T01:31:51Z" level=error msg="timeout accessing kv store" action=get category=client error="context deadline exceeded" key=nameidx/locks/maintenance module=store retry_count=0
time="2019-01-11T01:31:52Z" level=error msg="failed to retrieve maintenance mode status, skipping health update" action=establish category=leader error="context deadline exceeded" module=scheduler term=1
time="2019-01-11T01:31:52Z" level=error msg="failed to handle node health changes" category=leader error="context deadline exceeded" module=scheduler term=1
time="2019-01-11T01:31:55Z" level=error msg="failed to read a36fbcdccb7ac318 on stream Message (read tcp **.***.***.***:51380->**.***.***.***:5707: i/o timeout)" category=etcdserver module=store
time="2019-01-11T01:31:55Z" level=error msg="[cas storageos/locks/scheduler]: kvdb error: context deadline exceeded, retry count: 0\n" module=store
time="2019-01-11T01:31:55Z" level=error msg="lock operation failure" error="context deadline exceeded" key=locks/scheduler module=store-locks
time="2019-01-11T01:31:55Z" level=error msg="abandoning expired lock" key=locks/scheduler module=store-locks
time="2019-01-11T01:31:55Z" level=warning msg="lost leadership, stopping scheduler activities" module=scheduler
time="2019-01-11T01:31:55Z" level=info msg="leader told to stop, cancelling context" category=leader module=scheduler term=1
time="2019-01-11T01:31:55Z" level=info msg="leader tasks stopped" action=revoke category=leader module=scheduler term=1
time="2019-01-11T01:31:55Z" level=warning msg="volume watcher received error 'watch stopped'" module=watcher
time="2019-01-11T01:31:55Z" level=warning msg="node watcher received error 'watch stopped'" module=watcher
time="2019-01-11T01:31:55Z" level=error msg="failed to get node capacity stats" category=capacity error="nats: connection closed" module=taskrunner
time="2019-01-11T01:31:55Z" level=error msg="failed to get node capacity stats" category=capacity error="nats: connection closed" module=taskrunner
time="2019-01-11T01:32:00Z" level=info msg="received stop signal" action=start category=discovery module=ha service=node version=v1
time="2019-01-11T01:33:06Z" level=error msg="failed to read a36fbcdccb7ac318 on stream MsgApp v2 (read tcp **.***.***.***:51382->**.***.***.***:5707: i/o timeout)" category=etcdserver module=store
time="2019-01-11T01:33:07Z" level=error msg="timeout accessing kv store" action=get category=client error="context deadline exceeded" key=diagnostics/2e2539fe-77c7-e40e-aa8f-cf0b37da0291 module=store retry_count=0
time="2019-01-11T01:33:07Z" level=error msg="timeout accessing kv store" action=list category=client error="context deadline exceeded" module=store prefix=volumes/default/ retry_count=0
time="2019-01-11T01:33:07Z" level=error msg="failed to read a36fbcdccb7ac318 on stream Message (read tcp **.***.***.***:52812->**.***.***.***:5707: i/o timeout)" category=etcdserver module=store
time="2019-01-11T01:33:09Z" level=error msg="lock operation failure" error="context deadline exceeded" key=locks/scheduler module=store-locks
time="2019-01-11T01:33:11Z" level=error msg="timeout accessing kv store" action=list category=client error="context deadline exceeded" module=store prefix=nodes retry_count=0
time="2019-01-11T01:33:11Z" level=error msg="timeout accessing kv store" action=list category=client error="context deadline exceeded" module=store prefix=volumes retry_count=0
Without the master IP added to the join parameter
time="2019-01-11T01:36:49Z" level=info msg="by using this product, you are agreeing to the terms of the StorageOS Ltd. End User Subscription Agreement (EUSA) found at: https://eusa.storageos.com" module=command
time="2019-01-11T01:36:49Z" level=info msg=starting address=**.***.***.*** hostname=c1w2c id=934146f4-14ef-ee0e-2a25-734fbb6daaf2 join="10.8.96.4,10.8.96.5" module=command version="StorageOS 1.1.0 (5e8ccdf), built: 2019-01-03T160821Z"
time="2019-01-11T01:36:57Z" level=info msg="kv store ready" action=wait address="http://127.0.0.1:5706" backend=embedded category=etcd module=cp
time="2019-01-11T01:36:58Z" level=info msg="this cluster is configured to send anonymous usage data to help us develop StorageOS (https://docs.storageos.com/docs/reference/telemetry)" module=cp
time="2019-01-11T01:36:58Z" level=info msg="joining scheduler elections" module=scheduler
time="2019-01-11T01:36:59Z" level=info msg="liocheck: OK" category=fslio module=dataplane proc=liocheck
time="2019-01-11T01:36:59Z" level=info msg="startup complete - ready for operation" module=command
time="2019-01-11T01:38:04Z" level=info msg="became leader, initialising" module=scheduler
time="2019-01-11T01:38:04Z" level=info msg="started leader tasks" action=establish category=leader module=scheduler term=1
time="2019-01-11T01:39:17Z" level=error msg="timeout accessing kv store" action=list category=client error="context deadline exceeded" module=store prefix=volumes retry_count=0
time="2019-01-11T01:39:17Z" level=error msg="timeout accessing kv store" action=list category=client error="context deadline exceeded" module=store prefix=nodes retry_count=0
time="2019-01-11T01:39:17Z" level=error msg="timeout accessing kv store" action=list category=client error="context deadline exceeded" module=store prefix=volumes/default/ retry_count=0
time="2019-01-11T01:39:17Z" level=error msg="timeout accessing kv store" action=get category=client error="context deadline exceeded" key=diagnostics/934146f4-14ef-ee0e-2a25-734fbb6daaf2 module=store retry_count=0
time="2019-01-11T01:39:24Z" level=info msg="Etcd did not return any transaction responses for key (locks/scheduler)" module=store
time="2019-01-11T01:39:24Z" level=error msg="lock operation failure" error="value mismatch" key=locks/scheduler module=store-locks
time="2019-01-11T01:39:24Z" level=error msg="abandoning expired lock" key=locks/scheduler module=store-locks
time="2019-01-11T01:39:24Z" level=warning msg="lost leadership, stopping scheduler activities" module=scheduler
time="2019-01-11T01:39:24Z" level=info msg="leader told to stop, cancelling context" category=leader module=scheduler term=1
time="2019-01-11T01:39:24Z" level=info msg="leader tasks stopped" action=revoke category=leader module=scheduler term=1
time="2019-01-11T01:39:24Z" level=warning msg="volume watcher received error 'watch stopped'" module=watcher
time="2019-01-11T01:39:24Z" level=warning msg="node watcher received error 'watch stopped'" module=watcher
time="2019-01-11T01:39:24Z" level=info msg="received stop signal" action=start category=discovery module=ha service=node version=v1
time="2019-01-11T01:40:27Z" level=error msg="failed to read 2aa116e2b4865444 on stream Message (read tcp **.***.***.***:33772->**.***.***.***:5707: i/o timeout)" category=etcdserver module=store
time="2019-01-11T01:40:30Z" level=error msg="lock operation failure" error="context deadline exceeded" key=locks/scheduler module=store-locks
I replaced public IP's with *
as my system isn't exactly secure yet.
@jbonnett92 Hi, as per the logs, you're running StorageOS 1.1.0
, that's from the helm chart 0.2.10 release.
The failure seems to be related to connectivity issue with the embedded key value store. Usually when you have IPs in the join token, one of those IPs should be your current node's IP address. You've redacted the starting address (advertise IP), but you've some IPs in the join token. Can you check if the advertise IP of this node is in one of the IPs in the join token? If not that could be the reason for kv store connection timeouts.
Also, I would recommend clearing /var/lib/storageos
from all the nodes to ensure any old configurations are removed.
An alternative to avoid messing around with these IPs is to use the cluster operator, it will make the installation easier. You can follow the docs and try it out. Since you need the latest version, please add the following in the cluster spec when you try it:
apiVersion: "storageos.com/v1alpha1"
kind: "StorageOSCluster"
...
spec:
...
images:
nodeContainer: storageos/node:1.1.0
...
This will work with k8s 1.12 and below. Support for k8s 1.13 will be release soon.
Hi @darkowlzz, Sorry I just realised how confusing that sounded. So I have two networks all the nodes a public one and a private one.
For the join I added the private IP's of all the nodes (master and workers) for the first and only the worker nodes for the second in the join. I did also try the hostname, although had the same issues.
I am currently on Kubernetes 1.13.
Private IPs should work fine. And this would work in k8s 1.13 as well, but not if you are installing using CSI. The in-tree plugin (non-CSI - default) installation should work fine on k8s 1.13.
Here's an example of a setup, hope it helps. Let's say I've 3 nodes with internal IPs 10.1.10.165, 10.1.10.166 and 10.1.10.167. If I run the installation right, the log would be something like:
starting address=10.1.10.167 hostname=test01 id=9341sdf2-2a25-734fbb6daaf2 join="10.1.10.165,10.1.10.166,10.1.10.167"
Can you verify that's what you get in the logs?
I'll check if there's anything else that could be causing this issue.
@jbonnett92 can you also share the helm command with all the parameters you're running to install?
@darkowlzz Yes and the only other things installed is Calico CNI before StorageOS and GitLab after so that GitLab can use the StorageOS PVC. That line was in the logs above although it shows the public IP's for some reason.
Do you mean 3 nodes as in 1 master and 2 workers or 3 workers?
Here was the Helm command that I used:
helm install storageos/storageos --name=storageos --version=0.2.10 --namespace=storageos --set cluster.join="10.8.96.3\,10.8.96.4\,10.8.96.5"
@jbonnett92 What size is your cluster? When @darkowlzz mentioned 3 nodes that was referring to 3 worker nodes. StorageOS will only work as a single node cluster or as a three, or more, node cluster. This is because we use etcd to maintain consensus and it's not possible to get a consensus with a two node cluster.
My suggestion would be that you install StorageOS on three nodes, alternatively you can install StorageOS on a single node but this means you won't be able to use volume replicas, and therefore your volumes won't be highly available.
If you have further questions I'm also available on our public slack channel slack.storageos.com
@avestuk I have 1 master and 2 workers.
@jbonnett92 then the only remaining option is to install StorageOS on a single node. A single node installation has some limitations but it'll work.
Hi, closing this out because I believe the original issue was resolved. @jbonnett92 if you get any issues not related to this ticket, please do come and find us on our public Slack channel, or mail support@storageos.com. Thanks!
Hi, I am trying to install StorageOS on my Kubernetes cluster, Kubernetes is installed in CoreOS. Looking at the logs the one line that stands out is:
panic: runtime error: slice bounds out of range
Any ideas what could be causing this?
Thanks, Jamie