kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
115.23k stars 40.58k forks source link

Update Fluentd Configurations to extract labels and other metadata #8001

Closed satnam6502 closed 8 years ago

satnam6502 commented 10 years ago

Continuing discussion from #3764 @a-robinson @mr-salty CC: @dchen1107 @roberthbailey

jimmidyson commented 10 years ago

@satnam6502 - I have the same concern about the cache but it's required to reduce unnecessary calls to api server. Options are putting a TTL on cached info (easy but how long TTL?) or using watches on pod info to update cache (slightly trickier but should be doable).

Bearer tokens would also have to handle refreshes, certs seem easier for now, but we can change if needed.

satnam6502 commented 10 years ago

I'm keen for us to keep building the base images as I do -- I try to use common blocks (ubuntu:14.04, jre-7 etc.) amongst the Elasticsearch, Logging and other images to try and factorize out layers to reduce load time etc.

smarterclayton commented 10 years ago

Once we have per node controllers we can deliver fluentd as a privileged pod with a service account secret which means rotation would be handled automatically.

satnam6502 commented 10 years ago

@jimmidyson I've been told by @roberthbailey that currently bearer tokens have an indefinite lifetime.

jimmidyson commented 10 years ago

@smarterclayton - Service account secret would be file containing bearer token or something else? Would need to re-read a file at some point to pick up secret changes?

jimmidyson commented 10 years ago

@satnam6502 @roberthbailey but I assume that will change to require refreshes at some point?

jimmidyson commented 10 years ago

@satnam6502 - other than the base os image I don't see the fluentd (ruby) image containing any layers in common with say an elasticsearch image (java)

satnam6502 commented 10 years ago

@jimmidyson how about for now using the token-admin secret (as it is done in #7988 to speak to the api sever) and I can add a gem-install step and update the Fluentd config in the two Fluentd images? Once we have a better solution for accessing the apiserver from the master we can switch to that. @erictune

I am very keen to propagate labels so what you've written is very useful (and I had intended to write something similar -- except I had to learn Ruby first!).

roberthbailey commented 10 years ago

@jimmidyson You are correct that at some point we will need a credential refresh story.

smarterclayton commented 10 years ago

Containing signed jwt assertion that can be invalidated on the server and rotated. Restart or redeploy would pick up newest. Eventually we would expand the reach of it if necessary. That will land in the next week probably.

On May 8, 2015, at 6:49 PM, Jimmi Dyson notifications@github.com wrote:

@smarterclayton - Service account secret would be file containing bearer token or something else? Would need to re-read a file at some point to pick up secret changes?

— Reply to this email directly or view it on GitHub.

satnam6502 commented 10 years ago

@jimmidyson -- why don't you go ahead and make the PR to update the images? I tried to assign the issue to you but failed.

jimmidyson commented 10 years ago

@satnam6502 - let me add in the option to use bearer token too when I get a chance.

jimmidyson commented 10 years ago

@satnam6502 - Happy to! I'll look at updating the images & plugin config early next week.

jimmidyson commented 10 years ago

I've added in option to use bearer_token_file in addition to client certs, released in 0.2.0.

@smarterclayton - can you expand on what is required to use the jwt assertion? Exchange the jwt for an access token? Can't see details at https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/service_accounts.md where I was expecting to see how to use it.

smarterclayton commented 10 years ago

It would be passed as a bearer token, the authorized would simply handle it differently than "normal" tokens.

On May 9, 2015, at 11:17 AM, Jimmi Dyson notifications@github.com wrote:

I've added in option to use bearer_token_file in addition to client certs, released in 0.2.0.

@smarterclayton - can you expand on what is required to use the jwt assertion? Exchange the jwt for an access token? Can't see details at https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/service_accounts.md where I was expecting to see how to use it.

— Reply to this email directly or view it on GitHub.

jimmidyson commented 10 years ago

@smarterclayton That's easy then! Does a service account bearer token expire? If so, how should this be handled? Exit process & let container be recreated with new service account token allocated?

jimmidyson commented 10 years ago

@satnam6502 Adding in cache TTL option. What do you think is a sane default? I'll add in pod watch option soon too.

smarterclayton commented 10 years ago

Right now it doesn't expire but can be rotated at any time. In the event of a rotation, you'd want to trigger a rolling update or rollout (we don't have a good word for rolling out changes to a per node controller yet, although I would hope the deployment object could handle it eventually).

Our secret redistribution guarantee today is "when a new pod is created" which has some nice properties. @erictune and I had discussed on several occasions whether we needed a stronger guarantee, but in the short term both volumes and secrets specifically feel more aligned with the lifespan of the pod. If we allow pod updates in the future we could potentially have that align with updates.

On May 10, 2015, at 5:11 AM, Jimmi Dyson notifications@github.com wrote:

@smarterclayton That's easy then! Does a service account bearer token expire? If so, how should this be handled? Exit process & let container be recreated with new service account token allocated?

— Reply to this email directly or view it on GitHub.

satnam6502 commented 10 years ago

@jimmidyson if the purpose of the TTL is to limit how much data we store in the plug-in plus limit the rate of API calls to apiserver then how about 1 hour?

a-robinson commented 10 years ago

@satnam6502 an hour seems really high if the point of refreshing is to pick up changes to the pod metadata. Once every 30 seconds or minute seem like more reasonable response times to changes.

jimmidyson commented 10 years ago

@a-robinson @satnam6502 once I get pod watching working properly it will make the TTL less (un?) important.

satnam6502 commented 10 years ago

@a-robinson OK.

jimmidyson commented 10 years ago

@satnam6502 @a-robinson Pod watching is now in the fluentd kubernetes plugin. Just working through the Docker images & testing.

satnam6502 commented 10 years ago

Great. Which Docker images?

jimmidyson commented 10 years ago

@satnam6502 The fluentd images as we discussed on this issue last week.

satnam6502 commented 10 years ago

I wonder if we need a branch for things going into V1.0 and things that are not?

On Fri, May 15, 2015 at 12:14 PM, Jimmi Dyson notifications@github.com wrote:

@satnam6502 https://github.com/satnam6502 The fluentd images as we discussed on this issue last week.

— Reply to this email directly or view it on GitHub https://github.com/GoogleCloudPlatform/kubernetes/issues/8001#issuecomment-102496944 .

jimmidyson commented 10 years ago

So you don't want label metadata in v1.0?

satnam6502 commented 10 years ago

Let's go for it and see if it can work reliably. The removal of secret access from mirror pods seems like an issue.

jimmidyson commented 10 years ago

What do you mean by "removal of secret access from mirror pods"? What's a mirror pod?

satnam6502 commented 10 years ago

The fluentd-elasticsearch logging pod and the fluentd-cloud-logging pods are special -- they are created from manifest files and one instance is laid down on each node.

jimmidyson commented 10 years ago

Ah ok. I can't find the details on secrets on mirror/static pods - do you have any details?

jimmidyson commented 10 years ago

Another option is to have a log aggregator running as a normal pod & the log collector static pods can forward to the log aggregator for enrichment before storing in ES. Would prefer to not have to do that, but it's another option.

satnam6502 commented 10 years ago

Agreed, it would be good to avoid the extra log aggregation and try to maintain a direct link from Fluentd to Elasticsearch. I did try to use your plugin -- see https://github.com/GoogleCloudPlatform/kubernetes/compare/master...satnam6502:labels -- but I could not get it to work. Any idea what I might be doing wrong modulo secrets?

jimmidyson commented 10 years ago

I'm not at my computer right now, but it looks OK to a cursory look. Do you get any errors?

satnam6502 commented 10 years ago

I don't get any errors -- but I don't see any log lines either. I'd been a couple of days since I last looked at it -- I hope to get back to it after my current stack of tasks.

On Fri, May 15, 2015 at 1:46 PM, Jimmi Dyson notifications@github.com wrote:

I'm not at my computer right now, but it looks OK to a cursory look. Do you get any errors?

— Reply to this email directly or view it on GitHub https://github.com/GoogleCloudPlatform/kubernetes/issues/8001#issuecomment-102522454 .

jimmidyson commented 10 years ago

@satnam6502 OK I've seen what the issue is. Bad documentation on my part :) It does work but requires mounting the docker socket at the moment, not desirable as it would require privileged container to do so, otherwise SELinux will prevent access to the docker socket. Don't want to run a privileged container unless we have to. So hang tight with this piece of work - very nearly there :)

In the meantime, I've put in PR #8374 to make the Docker log symlink easier to parse. Currently there's no way to separate out the namespace & container name from the log name. Once this is in I'll update the functionality in the fluentd plugin to use it. I've been speaking about secrets/service accounts in static pods & it seems that this isn't really covered yet, but is essential for future use cases. Will raise a separate issue to discuss & see what comes of that.

satnam6502 commented 10 years ago

I will hang tight :-)

satnam6502 commented 9 years ago

Un-assinging myself since I am leaving Google and Kubernetes.

a-robinson commented 9 years ago

Now that GKE has hit GA, I'll be carving off some time to get this working for Google Cloud Logging.

jimmidyson commented 9 years ago

@a-robinson We should be able to leverage the fluentd PR you're reviewing - might be best to get all done in one for consistency, translating properties properly for Elasticsearch & GCL as required.

a-robinson commented 9 years ago

Yup, I'm hoping to be able to use that, but since GCL is going to require some extra work on fluent-plugin-google-cloud anyway I think it makes sense to get the ES improvements in first.

a-robinson commented 9 years ago

Just an update here: this is basically ready to go, primarily due to @jimmidyson's metadata filter plugin, with relevant integration PRs including #8632, https://github.com/GoogleCloudPlatform/fluent-plugin-google-cloud/pull/30, https://github.com/GoogleCloudPlatform/fluent-plugin-google-cloud/pull/31, and https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/6.

However, that plugin can't really do its thing until it can authenticate to the apiserver. We may be able to use a short term hack to get it access by using the kubelet's credentials, but the real solution will be to use the daemon controller once it's ready (#1518 / #13182).

In the meantime, we'll still be able to get most of the information attached, we'll just miss out on pod labels, pod ID, and namespace ID.

jchauncey commented 9 years ago

Where is this at as far as running it on a non-google cloud kubernetes stack?

jimmidyson commented 9 years ago

Just need to try it out with daemon controller really. Without that you don't get the reuse of pod labels in stored events as it can't auth to API server. Would be good to get someone testing it with daemon controller if possible.

jchauncey commented 9 years ago

Ok I'm still fairly new to this but I will give it a shot. I'm going to use the image in cluster/addons to build a fluentd image unless there is one already hosted on a public repo.

jimmidyson commented 9 years ago

That would be awesome! There is an updated image that should work for you in Docker hub at fabric8/fluentd-kubernetes:1.3. Will be good to hear how you get on.

jchauncey commented 9 years ago

hrm Im stuck on getting kubectl to take the manifest -

F1118 17:16:35.986943   20636 helpers.go:96] error validating "fluentd-daemon.yaml": error validating data: couldn't find type: v1beta1.DaemonSet; if you choose to ignore these errors, turn validation off with --validate=false

I have this when starting my api server

 --runtime-config=extensions/v1beta1=true,extensions/v1beta1/daemonsets=true,extensions/v1beta1/deployments=true

Version:

╰─○ kubectl version
Client Version: version.Info{Major:"1", Minor:"1", GitVersion:"v1.1.1", GitCommit:"92635e23dfafb2ddc828c8ac6c03c7a7205a84d8", GitTreeState:"clean"}
Server Version: version.Info{Major:"1", Minor:"1", GitVersion:"v1.1.1", GitCommit:"92635e23dfafb2ddc828c8ac6c03c7a7205a84d8", GitTreeState:"clean"}

manifest -

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: fluentd-elasticsearch
  labels:
    heritage: helm
    k8s-app: fluentd-elasticsearch
    version: v1
    kubernetes.io/cluster-service: "true"
  spec:
    containers:
    - name: fluentd-elasticsearch
      image: fabric8/fluentd-kubernetes:1.3
      securityContext:
        privileged: true
      resources:
        limits:
          cpu: 100m
      volumeMounts:
      - name: varlog
        mountPath: /var/log
      - name: varlibdockercontainers
        mountPath: /var/lib/docker/containers
        readOnly: true
      env:
      - name: "ES_HOST"
        value: "elasticsearch"
      - name: "ES_PORT"
        value: "9200"
    volumes:
    - name: varlog
      hostPath:
        path: /var/log
    - name: varlibdockercontainers
      hostPath:
        path: /var/lib/docker/containers
rwehner commented 9 years ago

@jchauncey I'm no expert (just trying to get this going like you), but I think the daemonset manifest you have needs a few tweaks to add .spec.template.metadata and .spec.template.spec.containers so it looks something like:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: fluentd-elasticsearch
  labels:
    heritage: helm
    k8s-app: fluentd-elasticsearch
    version: v1
    kubernetes.io/cluster-service: "true"
spec:
  template:
    metadata:
      name: fluentd-elasticsearch
      labels:
        heritage: helm
        k8s-app: fluentd-elasticsearch
        version: v1
        kubernetes.io/cluster-service: "true"
    spec:
      containers:
      - name: fluentd-elasticsearch
        image: fabric8/fluentd-kubernetes:1.3
        securityContext:
          privileged: true
        resources:
          limits:
            cpu: 100m
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        env:
        - name: "ES_HOST"
          value: "elasticsearch"
        - name: "ES_PORT"
          value: "9200"
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

That and an apiserver with --runtime-config=extensions/v1beta1/daemonsets=true,extensions/v1beta1/deployments=true,extensions/v1beta1/jobs=true,extensions/v1beta1/ingress=true seemed to get the pods running on all nodes for me. I haven't verified actual log output yet, but can report back as well when I do. Good luck!

rwehner commented 9 years ago

@jchauncey I ran the following on one of my apiservers to verify: curl -s http://${apiserver}:8080/apis/extensions/v1beta1

On my v1.1.1 cluster where this appears to be working I get:

{
  "kind": "APIResourceList",
  "groupVersion": "extensions/v1beta1",
  "resources": [
    {
      "name": "daemonsets",
      "namespaced": true
    },
    {
      "name": "daemonsets/status",
      "namespaced": true
    },
...
jchauncey commented 9 years ago

no matter what manifest i try i still get

error validating "fluentd-daemon.yaml": error validating data: couldn't find type: v1beta1.DaemonSet; if you choose to ignore these errors, turn validation off with --validate=false