Closed rivernews closed 4 years ago
Following this post to install a EKB stack.
Oops! The terraform apply failed while provisioning helm release! How to revert? You may do terraform destroy -target=helm_release....
first. But that doesn't guarantee the resources in K8 is cleaned up. Indeed in my case, all the pods, deployment, statefulset are still there.
To do a better cleaning, download helm client.
brew install kubernetes-helm
. helm --kubeconfig <your kubeconfig YAML> list
to list all the helm release you have on k8 cluster. tiller_image = "gcr.io/kubernetes-helm/tiller:v2.16.1"
. We did a upgrade here from 2.11.0 to 2.16.1 in order to match k8 server tiller version. How? This issue actually let us be able to re-install tiller on k8.helm delete metricbeat-release
, helm delete kibana-release
, helm delete elasticsearch-release
, and also . ./my-helm.sh del --purge elasticsearch-release
.kubectl get all -n kube-system
. Boom.. ./my-kubectl.sh delete pvc -n kube-system elasticsearch-master-elasticsearch-master-0 elasticsearch-master-elastics earch-master-1 elasticsearch-master-elasticsearch-master-2
Next thing is to re-try. This time, let's install things one by one. But before retrying, we want to change our chart config to avoid the crashing again.
terraform apply
! Init:CrashLoopBackOff
Debugging a initial container. Seems like the file-permission initial container is having trouble:
. ./my-kubectl.sh logs elasticsearch-master-0 -n kube-system -c file-permissions
chown: /usr/share/elasticsearch/data/lost+found: Operation not permitted
chown: /usr/share/elasticsearch/data/lost+found: Operation not permitted
chown: /usr/share/elasticsearch/data: Operation not permitted
chown: /usr/share/elasticsearch/data: Operation not permitted
chown: /usr/share/elasticsearch/: Operation not permitted
chown: /usr/share/elasticsearch/: Operation not permitted
// describe the pod elasticsearch-master-0 -n kube-system
file-permissions:
Command:
chown
-R
1000:1000
/usr/share/elasticsearch/
python release.py -f
. ./my-kubectl.sh get pods --namespace=kube-system -l app=elasticsearch-master -w
# see if the pod can go through the initial containers successfully, if not, try to figure out why
. ./my-kubectl.sh logs elasticsearch-master-0 -n kube-system -c file-permissions
# then start over by following below code
python release.py -d -t helm_release.elasticsearch
. ./my-helm.sh delete elasticsearch-release
. ./my-helm.sh del --purge elasticsearch-release
. ./my-kubectl.sh delete pvc -n kube-system elasticsearch-master-elasticsearch-master-0
Issue We observed that Kibana has been in READY 0/1 for a very long time (more than 6 minutes), yet no error has been surfaced. Inspecting the pod's log shows that it is installing a 3rd plugin.
Cause Potentially due to computation resources running low and gets super slow. The memory usage in DO dashboard shows a 91% load.
Solution or Next Step Decreasing the Kibana resources request might do, or enlarging DO K8 cluster's resource.
Debug workflow
python release.py -f
. ./my-kubectl.sh get pods --namespace=kube-system -l app=kibana -w
# see if the pod can go through the initial containers successfully, if not, try to figure out why
. ./my-kubectl.sh logs --follow <kibana pod name> -n kube-system
# then start over by following below code
# start over
python release.py -d -t helm_release.kibana
. ./my-helm.sh delete kibana-release
. ./my-helm.sh del --purge kibana-release
python release.py -f
Debug log
A lot of things happened here.
system:anonymous
error cannot create k8 resources like namespace, serviceaccount, etc. Solved it by feeding k8 credentials for terraform kubernetes provider, from using client_key
, client_certificate
, etc, to only token
, following the example on terraform doc.Unable to revive connection: http://elasticsearch-master:9200/
, and Request Timeout after 30000ms
curl
inside kibana pod:sh-4.2$ curl -v http://elasticsearch-master:9200
* About to connect() to elasticsearch-master port 9200 (#0)
* Trying 10.245.114.47...
* Connection timed out
* Failed connect to elasticsearch-master:9200; Connection timed out
* Closing connection 0
OOMKilled
. So have to increase from esJavaOpts: "-Xmx128m -Xms128m"
to 256m. Also very importantly, have to use same value for both ms and mx, otherwise will give error.
Items
Handle data load and real-time sync with PostgresWill create another ticket for this.A branch elastic-stack is created for this issue.
Target resources
Reference