Deploy & host elasticsearch service to support app use

rivernews / iriversland2-kubernetes

Terraform provisioning K8 infra for iriversland2, my personal website, as well as other projects.

0 stars 1 forks source link

Provisioning Elasticsearch instance

Consider running in dev mode first. When going for production, there are several config flagged as important in the es image instructions.

Kickstart

Following this post to install a EKB stack.
Oops! The terraform apply failed while provisioning helm release! How to revert? You may do terraform destroy -target=helm_release.... first. But that doesn't guarantee the resources in K8 is cleaned up. Indeed in my case, all the pods, deployment, statefulset are still there.
To do a better cleaning, download helm client.
- brew install kubernetes-helm.
- Now use helm --kubeconfig <your kubeconfig YAML> list to list all the helm release you have on k8 cluster.
- You might face version inconsistency and got aborted. We only have two choice: either change the version on k8 which is controlled by terraform, or install client at a specific version by brew. Well, turns out that installing specific version in brew is now freakin hard. So, instead, just change the terraform helm provider attribute tiller_image = "gcr.io/kubernetes-helm/tiller:v2.16.1". We did a upgrade here from 2.11.0 to 2.16.1 in order to match k8 server tiller version. How? This issue actually let us be able to re-install tiller on k8.
- Then do helm delete metricbeat-release, helm delete kibana-release, helm delete elasticsearch-release, and also . ./my-helm.sh del --purge elasticsearch-release.
- Verify kubectl get all -n kube-system. Boom.
- One thing, we have to clean up pvc as well: . ./my-kubectl.sh delete pvc -n kube-system elasticsearch-master-elasticsearch-master-0 elasticsearch-master-elastics earch-master-1 elasticsearch-master-elasticsearch-master-2
Next thing is to re-try. This time, let's install things one by one. But before retrying, we want to change our chart config to avoid the crashing again.
- elasticsearch: just change from 3 node to 1 node.
- terraform apply!
- No!! Init:CrashLoopBackOff
Debugging a initial container. Seems like the file-permission initial container is having trouble:

. ./my-kubectl.sh logs  elasticsearch-master-0 -n kube-system -c file-permissions
chown: /usr/share/elasticsearch/data/lost+found: Operation not permitted
chown: /usr/share/elasticsearch/data/lost+found: Operation not permitted
chown: /usr/share/elasticsearch/data: Operation not permitted
chown: /usr/share/elasticsearch/data: Operation not permitted
chown: /usr/share/elasticsearch/: Operation not permitted
chown: /usr/share/elasticsearch/: Operation not permitted

// describe the pod elasticsearch-master-0 -n kube-system
file-permissions:
  Command:
      chown
      -R
      1000:1000
      /usr/share/elasticsearch/

We need to try to set the permission.
- Actually a quick way is to just use chmod 777, but will have to override the helm chart values.

python release.py -f

. ./my-kubectl.sh get pods --namespace=kube-system -l app=elasticsearch-master -w
# see if the pod can go through the initial containers successfully, if not, try to figure out why
. ./my-kubectl.sh logs  elasticsearch-master-0 -n kube-system -c file-permissions
# then start over by following below code

python release.py -d -t helm_release.elasticsearch
. ./my-helm.sh delete elasticsearch-release
. ./my-helm.sh  del --purge elasticsearch-release
. ./my-kubectl.sh delete   pvc -n kube-system  elasticsearch-master-elasticsearch-master-0

According to this exact same issue, we will try security context here.

Provisioning Kibana

Issue We observed that Kibana has been in READY 0/1 for a very long time (more than 6 minutes), yet no error has been surfaced. Inspecting the pod's log shows that it is installing a 3rd plugin.

Cause Potentially due to computation resources running low and gets super slow. The memory usage in DO dashboard shows a 91% load.

Solution or Next Step Decreasing the Kibana resources request might do, or enlarging DO K8 cluster's resource.

Debug workflow

python release.py -f . ./my-kubectl.sh get pods --namespace=kube-system -l app=kibana -w # see if the pod can go through the initial containers successfully, if not, try to figure out why . ./my-kubectl.sh logs --follow <kibana pod name> -n kube-system # then start over by following below code # start over python release.py -d -t helm_release.kibana . ./my-helm.sh delete kibana-release . ./my-helm.sh del --purge kibana-release python release.py -f

Debug log

We've already decrease kibana's resource request size from 512Mi to 256Mi. Meanwhile, the DO dashboard shows some abnormal state in CPU, it's lacking some data point. Perhaps the memory exhausted and interrupted some of the self-monitoring mechanism.

Might want to increase DO K8 cluster's resources. But can we upgrade it dynamically, and also will it cause any issue when we want to downgrade to 1vCPU?

Can we just upgrade the memory? Seems we have plenty of idle CPU still.
The cluster is having a lot of trouble, might want to do a size upgrade. Refer to DO pricing.

A lot of things happened here.

We updated digitalocean cluster k8 version. Then after terraform apply, the entire digitalocean kubernetes cluster vanished.

We started rebuild the infra. Then we met several changes in many places:

The system:anonymous error cannot create k8 resources like namespace, serviceaccount, etc. Solved it by feeding k8 credentials for terraform kubernetes provider, from using client_key, client_certificate, etc, to only token, following the example on terraform doc.
Got profile permission missing when applying helm release external dns terraform resources. Solution is to lock down to a working chart version.

OK, able to deploy kibana now

Kibana cannot connect to es: Unable to revive connection: http://elasticsearch-master:9200/, and Request Timeout after 30000ms

Done a curl inside kibana pod:

sh-4.2$ curl -v http://elasticsearch-master:9200 * About to connect() to elasticsearch-master port 9200 (#0) * Trying 10.245.114.47... * Connection timed out * Failed connect to elasticsearch-master:9200; Connection timed out * Closing connection 0

You can see it's able to resolve the IP, but somehow the es service didn't respond. Probably some issue on the es side not K8.

Just figure out the reason: elasticsearch port forwarding not working now. Pod is getting OOMKilled. So have to increase from esJavaOpts: "-Xmx128m -Xms128m" to 256m. Also very importantly, have to use same value for both ms and mx, otherwise will give error.

Right after the size upgrade, kibana start connecting to es. Can now port forward to kibana and see the portal! Just no data yet.

rivernews / iriversland2-kubernetes