k0sproject / k0s

k0s - The Zero Friction Kubernetes
https://docs.k0sproject.io
Other
3.86k stars 370 forks source link

k0s Documentation to 'change the syslog' reporting #3122

Closed Bugs5382 closed 1 year ago

Bugs5382 commented 1 year ago

Is your feature request related to a problem? Please describe.

So for the past few days I been monitoring now our fixed k0s environment. So it runs for 2 days and then crashes, until I started seeing an 80gb HDD fill up on the primary and active controller. So I went to look and the syslog is iflled up to the max with k0s messages that have just been spinning out of control.

So I been looking through the k0s code and documentation on how to bring this under control, and I can't seem to find any. Obviously there are some reasons to have it so verbose during diag, but this should be controlled with the standard, trace, debug, info, warn, error, critical, etc. levels that we all love.

Describe the solution you would like

In the noral config setting for either controller+worker, controller, or workder a setting in the config would be:

sysLogLevel: 'warning"

which would only generate warnings and/or make the path of the k0s not go into the syslog rule at /var/log/syslog but instead /var/log/k0s and setup rotation on this file instead with the same log level setting for this file.

Describe alternatives you've considered

I have looked through the code and can't find any an and making the log rotation of syslog defets any other purposes for other applications.

Additional context

No response

juanluisvaladas commented 1 year ago

Hi @Bugs5382 thanks for the report.

First of all, it's not normal to have so many logs. I would like to see if there is some problem that is provoking the logs to have an abnormal volume. Can you please provide some sample of the logs? I want to see if some component in particular is flooding the logs.

Also I want to acknowledge that we indeed need to spend some time with logs and their configuration and it's indeed something that we need to improve. We need to study how to handle this.

Bugs5382 commented 1 year ago

I can upload the logs. I have to scan the syslog which are not public facing issue. Where can I send them?

Bugs5382 commented 1 year ago

@juanluisvaladas I sent this over your email.

juanluisvaladas commented 1 year ago

Hi @Bugs5382,

I had a quick look at the logs and I found this:

$ wc -l syslog.1
 2182028 syslog.1
$ grep 'Failed to get a backend' syslog.1 -c
1218286

These lines look like server.go:454] \"Failed to get a backend\" err=\"No agent available\" dialID=<REMOVED>" component=konnectivity stream=stderr.

This means there is a problem with Konnectivity. Every time I have seen this (and it happens quite often) it's because in an HA setup either there isn't a load balancer or because it's not properly configured.

You can either supply your own external load balancer or you can let k0s deploy it for you using NLLB.

I don't have any news regarding the changes in the logging configuration yet.

PS: We're glad to make an exception for this issue, but support for confidential data falls under commercial support for k0s. Should you explicitly request it, I can share your contact information with a salesperson who can reach out to you with further details about our commercial offering.

Bugs5382 commented 1 year ago

@juanluisvaladas Thanks for lokking it over. So I agree, thanks for making an exception, but back to the feature request, if the logs were more clean, and filter out error messages from info, etc. then I could have found it. It just keep turning.

I do have my own NLB which I can implement. I will work on that and I will also post my findings. I have a Citrix Netscaller NLB that I will attempt.

Bugs5382 commented 1 year ago

@juanluisvaladas and @jnummelin - So good news. I had to re-do my terraform script which the last one I was using was bad and I was able to successfully get my Citrix LB to work. The syslog is super "none chatty" and I even worked on LB settings to get rid of the error messages that relate to #2068 popping up.

So I will write up some documentation, because I feel this is needed form a non- ha proxy LB setup that I have. I will even post my Citrix LB config for anyone to use.

So I being using chef to generate and "tag" my k0s nodes and then put them in a databag and then generate the terraform script which then generates the config file. So I am going to incorporate this all into my side.

The only error now is: May 18 12:52:20 xxxxxk0scontroller01 k0s[1512]: time="2023-05-18 12:52:20" level=info msg="2023-05-18 12:52:20.777031 I | http: TLS handshake error from x.x.x.254:34432: EOF" component=k0s-control-api

Which is the SNIP from the Citrix LB doing checking and I am trying to kill that since I have tcp monitors up. I'm going to use this issue for the PR.

juanluisvaladas commented 1 year ago

Hi @Bugs5382

 May 18 12:52:20 xxxxxk0scontroller01 k0s[1512]: time="2023-05-18 12:52:20" level=info msg="2023-05-18 12:52:20.777031 I | http: TLS handshake error from x.x.x.254:34432: EOF" component=k0s-control-api

Almost certainly this because the load balancer is doing a TCP healthcheck rather and an HTTPS request, which makes sense I saw in your PR you're doing:

add lb monitor tcp_kube_6443 TCP -LRTM DISABLED -destPort 6443
add lb monitor tcp_kube_8132 TCP -LRTM DISABLED -destPort 8132
add lb monitor tcp_kube_9443 TCP -LRTM DISABLED -destPort 9443

For kube-apiserver (port 6443) you can query the /livez endpoint, but these endpoints are authenticated. You can generate a user for them as long as the user has the closterrole system:monitoring. This is something that we should also address in the haproxy documentation...

More info in: 1- https://kubernetes.io/docs/reference/using-api/health-checks/ 2- https://kubernetes.io/docs/reference/access-authn-authz/rbac/

PS: I had a quick look at the PR but it's quite big and today is a busy day, I'll review it early next week.

Bugs5382 commented 1 year ago

yep. No worries. It also has a new document for terraform. :)

juanluisvaladas commented 1 year ago

Hi @Bugs5382, I'm closing this issue and opening #3164 instead so that we have a clean discussion discussing only that.