kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
109.66k stars 39.27k forks source link

Create guidelines for consuming application logs #42718

Closed crassirostris closed 7 years ago

crassirostris commented 7 years ago

Right now there are no clear guidelines as to how to intercept and handle logs produced by application containers on the node. For example, if I wanted today to implement or adapt a new solution for exporting logs, I would have hard time figuring out how to do it. This has created some confusion in the past, example being this issue: https://github.com/kubernetes/kubernetes/issues/24677.

At the same time, CRI is coming and container logs is an important part of the runtime interface. There's a proposal about log management in CRI and conformance to CRI is probably going to break existing log consumer.

It seems necessary to create an instruction for anyone who wants to integrate with Kubernetes logging pipeline. This instruction should align as much possible with CRI and give a detailed answer to the question how to integrate a log consumer with Kubernetes. Bringing this up will also allow to discover weak points of the current integration.

This issue bring up details for the potential guide to clarify, with commentaries and links to the related issues.

As with the CRI logging proposal, I suggest stating log files inside container as explicit non-goal. On one hand, it can be implemented already and there is a guide for this, on the other hand, this might even potentially become a part of the container engine, eliminating the need to handle this case separately. In any case, this problem deserves its own separate discussion.

Everyone is very welcome to give feedback and point out problems.

Where logs are stored and how they are discovered

Log consumer should read application logs directly from the runtime. According to CRI, logs are stored under /var/log/pods directory. For each container there's a file /var/log/pods/<podUID>/<containerName>_<instance#>.log which contains logs for the given run of the given container from the given pod.

Log consumer itself should watch for the new directories and log files and likewise be ready for log files to be removed at any time.

There's an issue about moving towards this file-based interface: https://github.com/kubernetes/kubernetes/issues/42188.

Log format

Logs inside log files are written in the CRI logging format: each log entry is decorated with a RFC 3339Nano timestamp prefix, the stream name (i.e., "stdout" or "stderr"), and ends with a newline. Examples:

2016-10-06T00:17:09.669794202Z stdout The content of the log entry 1
2016-10-06T00:17:10.113242941Z stderr The content of the log entry 2

It means that each line is decorated and should be unwrapped. This is a minimal set of metadata needed for correct handling log entries. More information about why this format has been chosen you can find in the CRI logging proposal.

Issue about log format in containerd: https://github.com/docker/containerd/issues/603

Log rotation

Log file should be rotated, so that logs don't consume all available disk space. Rotation can happen at any time. Rotation is performed by moving old file away and creating a new log file.

Old file may be available for some time after rotation, but its lifespan is not guaranteed, because there can be a disk starvation on the node, which requires to delete all logs right away.

Note: This does not align with the CRI perfectly, but using external tools for rotation in a current fashion (copytruncate, without closing a log file), as described in the existing CRI logging proposal, turned out to be not a best solution for a number of reasons. Also, moving old log file away and creating a new one is an approach better supported by existing tools.

Metadata

Log consumer should extract metadata that is enough for identifying pod and container from the log file name. To obtain further pieces of metadata, like pod name, namespace, labels, and etc log consumer should query API server for pods on this node and create a reverse lookup by podUID which then can be used to get necessary metadata. Log consumer should either attach the metadata to each log entry (or a log stream) similar to this solution or store the metadata in the dedicated backend for joining with log entries later, because pods can be rescheduled or deleted from the API server.

Potentially, kubelet can become responsible for providing this metadata since it's already there.

/cc @thockin @Random-Liu @yujuhong @dashpole @dchen1107 @vishh @piosz @dcowden @coffeepac @repeatedly @edsiper @igorpeshansky

xilabao commented 7 years ago

/cc

crassirostris commented 7 years ago

@richm raised a concern in https://github.com/kubernetes/kubernetes/issues/42188#issuecomment-285070322 that acquiring metadata from API server in a big cluster can have a significant overhead. Also it may be unreliable, causing log consumer to stop processing logs from a newly started pod, until API server is available.

Maybe it makes sense to expose versioned API for listing pods on kubelet?

crassirostris commented 7 years ago

To address @sosiouxme's https://github.com/kubernetes/kubernetes/issues/42188#issuecomment-285163025

The problem with relying on looking up metadata from the k8s API is that pods/namespaces/etc can be deleted or modified. If you are aggregating log data to a log analysis solution or for any reason want to keep track of logs after pods have been deleted, you won't be able to look that up in k8s after the fact. You have to either look it up as the logs are coming through and store the metadata with each log entry, or you have to separately store the metadata and join it in your logging solution (and I don't think ElasticSearch at least will let you search for/aggregate logs by a field in "joined" data so it's dubious whether that's even a valid option).

What I had in mind, when suggesting collecting metadata directly from API server is joining it with each log message (or a stream of log messages, that's an implementation detail) before pushing it to the logging backend, something like https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter does already. I definitely didn't mean to suggest joining live data with the old messages. Sorry for the confusion, I will clarify this detail, thanks for pointing it out!

coffeepac commented 7 years ago

@crassirostris +1 looks great! thanks!

igorpeshansky commented 7 years ago

/cc @qingling128 @dhrupadb @rbuskens

timothysc commented 7 years ago

re: logging format, why can't we pass the data to enrich w/ meta-data directly https://docs.docker.com/engine/admin/logging/log_tags/ ? This bypasses the API argument and most of the data already exists in env variables.

/cc @dchen1107

jcantrill commented 7 years ago

@crassirostris

Logs inside log files are written in the CRI logging format: each log entry is decorated with a RFC 3339Nano timestamp prefix, the stream name (i.e., "stdout" or "stderr"), and ends with a newline.

How is 'stream' envisioned? Would this include true application logs that do not include 'stdout' or 'stderr'. For instance, if I had something like an apache server that wrote an 'access' log? Would I expect my stream to be named based on a resource definition from the podspec or inferred by the logname written to a known directory?

crassirostris commented 7 years ago

@jcantrill I would prefer not to specify "stream" here, similarly to the CRI logging proposal, as it might change. Currently, container runtimes like Docker only support stdout and stderr and there's an official k8s documentation as to how to converge log files to this interface. Maybe in the future runtimes will support more than that, it doesn't contradict the interface, but that's out of the scope of CRI, at least for now.

That's my understanding. @yujuhong please correct me if I'm wrong.

crassirostris commented 7 years ago

BTW, @timothysc

re: logging format, why can't we pass the data to enrich w/ meta-data directly https://docs.docker.com/engine/admin/logging/log_tags/ ? This bypasses the API argument and most of the data already exists in env variables.

Is it CRI-compatible?

richm commented 7 years ago

@timothysc That work work too. How would that get written to the logs? as additional json parameters in the log e.g.

{"pod_uuid": "f8b0761c-4b5b-4c44-b347-b2a8c0d4f65f",
 "namespace_uuid": "62d35183-f695-4d3e-9881-567a438249a5",
 "pod_labels": ["label1", "label2"],
 "pod_annotations": ["annotation1", "annotation2"],
 "time": ...,
 "message": ....,
 "stream": "stdout"
}

?

thockin commented 7 years ago

We certainly don't want to log all that with every message. Given that UUID is UU, can't that be squirrelled away in a lookup table somewhere else out of the fast path?

Alternatively, something like a per-file header or special meta-record every few MB or something....

On Wed, Mar 29, 2017 at 8:15 PM, Richard Megginson <notifications@github.com

wrote:

@timothysc https://github.com/timothysc That work work too. How would that get written to the logs? as additional json parameters in the log e.g.

{"pod_uuid": "f8b0761c-4b5b-4c44-b347-b2a8c0d4f65f", "namespace_uuid": "62d35183-f695-4d3e-9881-567a438249a5", "pod_labels": ["label1", "label2"], "pod_annotations": ["annotation1", "annotation2"], "time": ..., "message": ...., "stream": "stdout" }

?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/42718#issuecomment-290290380, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVE_MN6fFjG5vlYtO_l0xson06_oKks5rqx5ogaJpZM4MWhNR .

crassirostris commented 7 years ago

@richm What do you think about watching API server for pod updates and storing lookup table?

richm commented 7 years ago

@crassirostris Do you mean in the log collector (e.g. fluentd)?

crassirostris commented 7 years ago

@richm Yes, sorry for the confusion. I'm basically advocating moving all enrichment logic to the log collector.

richm commented 7 years ago

@crassirostris This is what https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter is doing: it keeps a watch on the api server for changes to pod/namespace metadata, and enriches log records. Are you suggesting something different?

crassirostris commented 7 years ago

@richm No, I would love to see this approach standardized. I imagine companies that would prefer to implement their own logs exporter, not based on fluentd.

jcantrill commented 7 years ago

@crassirostris We have requested https://trello.com/c/5k85WYq1 being able to get the info from the kublet instead of the API server to avoid issues at scale we have seen with exhausting connections. Something to consider.

crassirostris commented 7 years ago

@jcantrill Thanks!

As a side note, @wojtek-t suggested that switching to watching apiserver will improve performance and may solve the problem with the load completely: kubelet holds several watches on apiserver already and it proved to be able to scale. AFAIR, node team is blocked on the api-machinery team for implementing api on the kubelet and may take a while. Watching apiserver may be a way to go until kubelet endpoint is implemented.

richm commented 7 years ago

@crassirostris

No, I would love to see this approach standardized. I imagine companies that would prefer to implement their own logs exporter, not based on fluentd. and @richm Yes, sorry for the confusion. I'm basically advocating moving all enrichment logic to the log collector.

What log collector are you talking about? Any log collector e.g. filebeat, logstash, splunk, fluentbit, rsyslog, fluentd? That would mean implementing support for

crassirostris commented 7 years ago

@richm

What log collector are you talking about? Any log collector e.g. filebeat, logstash, splunk, fluentbit, rsyslog, fluentd?

Yes, any log collector, that's what I meant.

That would mean implementing support for <..> in all of these log collectors?

Yes. Do I understand you correctly, you point is that those collectors are well-integrated with popular logging formats, like journald, which support metadata and it makes it easy to introduce k8s-specific metadata without additional coding work on the log collector side?

richm commented 7 years ago

@crassirostris I know that fluentd and rsyslog support reading from journald. I would assume they all support reading from journald. So yes, if the k8s specific metadata were introduced into the log record as additional fields, like journal metadata, that would make it very easy on the log collector side.

vishh commented 7 years ago

FYI: https://github.com/kubernetes/community/pull/306 talks about managing logs in the kubelet which will require coordination between kubelet and logging agents to guarantee log prematurely rotating logs.

As with the CRI logging proposal, I suggest stating log files inside container as explicit non-goal. On one hand, it can be implemented already and there is a guide for this, on the other hand, this might even potentially become a part of the container engine, eliminating the need to handle this case separately. In any case, this problem deserves its own separate discussion.

I disagree. Kubernetes should provide a Logging API that is not tied to a runtime. Given the recent changes with CRI, runtimes will have almost no control over logs and so kubernetes will have to manage log data outside of stdout/stderr as well.

According to CRI, logs are stored under /var/log/pods directory.

I'd like to clarify that CRI is an internal Kubelet API. CRI might store logs under an internal directory too. It would be better to phrase these statements as "kubelet will store logs under /var/log/pods" directory.

Issue about log format in containerd: docker/containerd#603

Kubelet should ideally receive the raw log stream from containerd and then add it's own format to avoid having to standardize its format in containerd. This is possible since containerd exposes logs via a FIFO based API.

Log rotation

You do not state which agent will be performing rotation. It is an important detail. For storage capacity isolation purposes, it is necessary that kubelet manages log rotation to keep long running pods active even if they log beyond their logging storage limits.

log consumer should query API server for pods

Why should every logging agent in the world have to include a kubernetes client? This proposal states that logging data will be made available via a file based API. Why not make metadata also available via the same API? For example, why not dump the Pod object from kubelet periodically to provide metadata for each pod? This metadata file could be under /var/log/pod/<podUID>/.metadata for example. Such an API will lower the QPS on the apiserver and will make the logging agents work even in case of network outages that affect API server availability.

@crassirostris I'd like to finalize the logging API soon since it's important for storage isolation. Can you convert this issue into a proposal since tracking and responding to all comments across various sub topics in this issue is pretty darn difficult.

vishh commented 7 years ago

@richm journald cannot (yet) handle per container rate limiting and capacity isolation. So it's not an option for multi-tenant k8s clusters. I sympathise with the want for having a single interface to consume logs (journalctl). Maybe a host level agent can ship all log data in /var/log/pods into journald?

richm commented 7 years ago

@vishh Fine with me. You can't base your logging solution on journald. I will have to use what I can get. I must have pod metadata (uuid, labels, annotations) and namespace metadata (uuid) that I can associate with each log record. I don't care how I get them, as long as they are readily available for general purpose log collecting agents. Note that labels and annotations are dynamic - they can be changed during the lifetime of the pod, so just writing a static file somewhere at pod creation time that contains the pod labels and annotations is not sufficient.

timothysc commented 7 years ago

Is it CRI-compatible?

I dunno, I think it should be.

@richm The examples show it as preamble, fwiw I only think you need pod_uuid b/c all other data can be cross referenced.

richm commented 7 years ago

@timothysc

The examples show it as preamble

which examples?

you need pod_uuid b/c all other data can be cross referenced.

cross referenced in some other way other than querying the API service?

timothysc commented 7 years ago

which examples?

https://docs.docker.com/engine/admin/logging/log_tags/

cross referenced in some other way other than querying the API service?

This will only happen during inspection and correlated with event data that can/should pushed to ES. If you do it this way you put all the load onto you ES queries on how you present the data.

richm commented 7 years ago

This will only happen during inspection and correlated with event data that can/should pushed to ES. If you do it this way you put all the load onto you ES queries on how you present the data.

So, by doing some sort of "join" to a "table" of K8s configuration + metadata which could be maintained separately (e.g. by a k8s configuration watcher). I would rather not have that as my only option - I would like to have the ability to enrich the log records with all of the relevant metadata (labels, annotations, namespace uuid) before they leave the node or cluster.

timothysc commented 7 years ago

So, by doing some sort of "join" to a "table" of K8s configuration + metadata which could be maintained separately (e.g. by a k8s configuration watcher).

That's exactly what I'm proposing.

would rather not have that as my only option - I would like to have the ability to enrich the log records with all of the relevant metadata (labels, annotations, namespace uuid) before they leave the node or cluster.

Could you explain why? The cost of doing this is the tradeoff here. Which was also highlighted by @thockin 's comment. "We certainly don't want to log all that with every message. Given that UUID is UU, can't that be squirrelled away in a lookup table somewhere else out of the fast path?"

I personally would rather put the burden of query<>join on ES then brute force this.

richm commented 7 years ago

In general it means every consumer of the log data must know how to do this correlation. Even in the simple Elasticsearch case, you wouldn't be able to use tools such as curl, estail, kopf, kibana, etc. directly without modifying each and every one of them to perform this merge operation. In these cases, we need to have a way to present a merged view of the data. It may be possible to do that with some extensive Elasticsearch java plugin programming, I don't know (I do know I wouldn't want to support/maintain such a beast). Then we're really locked into Elasticsearch, and can't easily switch to another logging store. Or we write another tool that "tails" Elasticsearch indices for the raw logs and the config changes, and creates indices of merged records, essentially duplicating that data.

There are other use cases rather than collect logs -> store in warehouse -> view with web based UI tool

One example:

collect logs -> message queue -> big data analysis e.g. fluentd -> kafka -> spark/storm

That would mean fluentd would need to write the k8s metadata/configuration to kafka and denote it somehow that spark/storm can then do the correlation. Or we would have to introduce more stages in the pipeline like this:

fluentd -- (raw data + config data) --> kafka -- (raw data + config data) --> some tool that can merge the raw data with the config data --> kakfa -- (merged data) --> apache spark/storm

fluentd writes the raw data to kafka, and (possibly the same fluentd? something else?) writes configuration/metadata change events to kafka. Some other tooling reads the raw data, correlates it with the config/md changes, merges into single records, and writes it back out to kafka for consumption by other

I get that pulling logs off of the nodes is in the fast path, and you want to minimize that processing as much as possible. And it is a tradeoff, between that speed, and the complexity of having to do that correlation elsewhere.

I suppose if we have to, we can live with our current solution to this problem, which is that there is no k8s metadata processing on each node - the daemonset node agents just ship the logs off as fast as possible. These node agents do not send the logs directly to Elasticsearch - instead, to a dc/rc set of fluentd "enrichers" which are responsible for doing the k8s meta enrichment. These can be controlled in numbers to reduce the load on the API server, and scaled up and down as needed to handle the load.

I would just rather have the option for the local node agent to send logs with all of the k8s metadata in each record, if possible.

yujuhong commented 7 years ago

Is it CRI-compatible? I dunno, I think it should be.

No, anything that relies solely on Docker features and bypasses the CRI API cannot be called "compatible".

We can discuss whether it makes sense for the CRI shim/runtime to decorate the logs with more metadata, but as @thockin pointed out in https://github.com/kubernetes/kubernetes/issues/42718#issuecomment-290305415, having this per entry seems to be a pretty big overhead.

yujuhong commented 7 years ago

I would just rather have the option for the local node agent to send logs with all of the k8s metadata in each record, if possible.

@richm, you can write your own fluentd-like daemonset to enrich the logs with metadata before leaving the node, and that fits with @crassirostris's proposal completely. Did I misunderstand your comment?

richm commented 7 years ago

@richm, you can write your own fluentd-like daemonset to enrich the logs with metadata before leaving the node, and that fits with @crassirostris's proposal completely. Did I misunderstand your comment?

We have implemented a fluentd-like daemonset (it is actually fluentd) that enriches the logs with metadata before leaving the node. fluentd uses the k8s metadata plugin to get the metadata from the API server to enrich the container logs.

crassirostris commented 7 years ago

@richm I understand your concerns about having to implement pretty complex piece of logic for each consumer in order to enrich log entries with metadata on logging agent. However, separating logs and metadata on the node seems to be a more flexible solution, because we want to be able to handle both when metadata is joined at the backend and when it's sent to the backend with each log entry.

We want to cooperate with the maintainers of https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter to make it compatible with the CRI and logging integration proposal, therefore if you use it already, nothing will change. This whole proposal is more about adjusting existing tools for logs exporting and creating guidelines for the new ones, rather than something that will directly affect end-users.

crassirostris commented 7 years ago

@vishh Thanks for the input! I will convert this to a proposal ASAP

FYI: kubernetes/community#306 talks about managing logs in the kubelet which will require coordination between kubelet and logging agents to guarantee log prematurely rotating logs.

Coordination between kubelet and logging agent is something yet to discuss. If we decide to do something like this, it will definitely be a part of logging integration guidelines.

I'd like to clarify that CRI is an internal Kubelet API. CRI might store logs under an internal directory too. It would be better to phrase these statements as "kubelet will store logs under /var/log/pods" directory.

Yes, thanks for pointing that out. The guidelines should not be about CRI. However, leaving CRI out of the discussion about those guidelines completely is not entirely correct, because it's tightly coupled.

Kubelet should ideally receive the raw log stream from containerd and then add it's own format to avoid having to standardize its format in containerd. This is possible since containerd exposes logs via a FIFO based API.

I don't think that Kubelet should handle those, because it imposes performance problems, adds logic to the kubelet and makes it SPoF for logs. Shim seems to be a better place for handling logs.

You do not state which agent will be performing rotation. It is an important detail. For storage capacity isolation purposes, it is necessary that kubelet manages log rotation to keep long running pods active even if they log beyond their logging storage limits.

Yes, I don't state it, but the working implies that some system component does it, not some logs exporting tool. As to who should perform rotations, it's not clear, I don't think kubelet is a best choice, for the reasons I mentioned before. Ideally, log rotations should be performed by shim, but that's yet to be agreed upon.

Why should every logging agent in the world have to include a kubernetes client? This proposal states that logging data will be made available via a file based API. Why not make metadata also available via the same API? For example, why not dump the Pod object from kubelet periodically to provide metadata for each pod? This metadata file could be under /var/log/pod//.metadata for example.

I don't see many benefits:

Such an API will lower the QPS on the apiserver and will make the logging agents work even in case of network outages that affect API server availability.

If implemented as a watch, that's not a significant load for the API server. Also there will be an API for listing pods on the kubelet anyway at some point, removing all load from the API server. If API server is unavailable, then pods are not scheduled on the node and metadata doesn't change. BTW, we cannot ensure with 100% certainty that .metadata file will be present also, some error handling should be in place anyway.

yujuhong commented 7 years ago

We have implemented a fluentd-like daemonset (it is actually fluentd) that enriches the logs with metadata before leaving the node. fluentd uses the k8s metadata plugin to get the metadata from the API server to enrich the container logs.

That meets your requirement, doesn't it? I think that aligns with @crassirostris's proposal. Anyone can write their own daemonset to customize the metadata they need, and this issue will define the contract/agreement between the daemonset and the kubelet so that it's easy to write such a daemonset.

In general, I think k8s-related metadata (e.g., pod name/namespace) can be retrieved directly from the apiserver, or from a versioned kubelet endpoint (non-existent today). For machine-related information, fluentd can get it directly since it runs on the same node.

The problem is with container runtime or even lower-level metadata (e.g., pid) because the CRI abstraction sits above all these, so kubelet cannot provide those. One possibility is to add a mechanism to allow runtime to provide arbitrary key-value pairs as part of the metadata for the log, or even part of the container status. I haven't put much thoughts into the specific mechanisms though.

vishh commented 7 years ago

@crassirostris

I don't think that Kubelet should handle those, because it imposes performance problems, adds logic to the kubelet and makes it SPoF for logs. Shim seems to be a better place for handling logs.

Sorry, I meant that the shim will handle writes and kubelet will handle rotation, long term storage, etc.

As to who should perform rotations, it's not clear, I don't think kubelet is a best choice, for the reasons I mentioned before. Ideally, log rotations should be performed by shim, but that's yet to be agreed upon.

For storage enforcement purposes, kubelet will have to perform rotations. Kubelet can delegate rotations to shims, but that's debatable. I'd like to understand the constraints a bit more before making that choice.

I don't see many benefits: You still need a watch to track metadta updates.

Yes. It will be a simple inotify instead of a full blown connection to the API server. I don't know how many languages k8s supports well as of now. For example an agent written in C might have difficulties accessing k8s API, but can trivially perform inotify and read a json blob.

If you dump a pod spec, you still need a kubernetes client to parse it

Not really. I feel implementing just a json/yaml parser is much simpler than having to deal with auth and transport. Moreover logs are exposed via a file based API. Why not expose metadata the same way? Have you prototyped metadata integration into one or more logging agents to get an idea on the complexity of various approaches being discussed here?

If you dump only subset of metadata in a custom format, you need to support this custom format and change it if some metadata pieces should be added

I was suggesting dumping the entire pod object. We can only dump the spec which might mean less updates.

More logic on kubelet

Kubelet has to include more logic either in the form of a REST endpoint or a file based API.

Also there will be an API for listing pods on the kubelet anyway at some point, removing all load from the API server

What can we do today that's generic enough? GA API objects are backwards compatible. Isn't that a good enough API?

If API server is unavailable, then pods are not scheduled on the node and metadata doesn't change.

Fair point! Agents have to know that it is "OK" for the Api server to be offline.

BTW, we cannot ensure with 100% certainty that .metadata file will be present also, some error handling should be in place anyway

Why so?

crassirostris commented 7 years ago

@vishh Thanks!

For storage enforcement purposes, kubelet will have to perform rotations. Kubelet can delegate rotations to shims, but that's debatable. I'd like to understand the constraints a bit more before making that choice.

Shim is modifying log file. If kubelet is also modifies log file, you need some synchronization between the two, or you will end up with lost/corrupted data. That's one of the reasons behind https://github.com/kubernetes/kubernetes/issues/38495 and https://github.com/kubernetes/kubernetes/pull/40634 . If only one entity modifies log file, everything becomes much simpler and robust.

Have you prototyped metadata integration into one or more logging agents to get an idea on the complexity of various approaches being discussed here?

Moreover, there's a solution that's used in production that queries API server: https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter It implements both direct querying and watching the k8s API server, 400 lines of code total, 80-100 dedicated to retrieving metadata. Also it's not fair to compare it to the file-based metadata, because the latter should also include changes in kubelet, whereas plugin above works right now.

Moreover logs are exposed via a file based API. Why not expose metadata the same way?

Because well-known log exporters don't work in this way (metadata separated from the log stream), there's no "file based API" for metadata, you have to implement it manually in each log exporter anyway

Kubelet has to include more logic either in the form of a REST endpoint or a file based API.

No, kubelet will have API anyway and it's not required for log exporters to work, whereas dumping metadata to files can be avoided.

Also there will be an API for listing pods on the kubelet anyway at some point, removing all load from the API server

What can we do today that's generic enough? GA API objects are backwards compatible. Isn't that a good enough API?

/pods endpoint on the kubelet is not versioned so it's wrong to advertise using it

BTW, we cannot ensure with 100% certainty that .metadata file will be present also, some error handling should be in place anyway

Why so?

On top of my mind: kubelet crashes before writing metadata to a file; kubelet cleans logs up and deletes this file while log exporter is reading them.

Yes. It will be a simple inotify instead of a full blown connection to the API server. I don't know how many languages k8s supports well as of now. For example an agent written in C might have difficulties accessing k8s API, but can trivially perform inotify and read a json blob. Not really. I feel implementing just a json/yaml parser is much simpler than having to deal with auth and transport.

I agree that dumping metadata to a file seems like an easier solution at the moment. However, I don't think it greatly simplifies the solution, but it has some implications in the long term, like:

vishh commented 7 years ago

Shim is modifying log file. If kubelet is also modifies log file, you need some synchronization between the two, or you will end up with lost/corrupted data. That's one of the reasons behind #38495 and #40634 . If only one entity modifies log file, everything becomes much simpler and robust.

What if shims do not need to perform log rotation?

Moreover, there's a solution that's used in production that queries API server: https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter It implements both direct querying and watching the k8s API server, 400 lines of code total, 80-100 dedicated to retrieving metadata. Also it's not fair to compare it to the file-based metadata, because the latter should also include changes in kubelet, whereas plugin above works right now.

Is this viable for all the agents out there? And what about auth? There is an open issue around restricting access to API objects from the node and kubelet might be restricted to only access the objects it needs to manage. Giving the same level of privileges to a logging agent is not ideal. With a file based API, logging agents need access to just the logging directory and do not need any other privileges.

Because well-known log exporters don't work in this way (metadata separated from the log stream), there's no "file based API" for metadata, you have to implement it manually in each log exporter anyway

Can you list a few examples? I have tried adapting filebeat logging agent in the past for k8s and it was pretty straightforward to inject metadata from the filesystem and filebeat works out of box with ELK stack.

/pods endpoint on the kubelet is not versioned so it's wrong to advertise using it

I meant re-using the pod object and not the /pods un-versioned REST endpoint.

On top of my mind: kubelet crashes before writing metadata to a file; kubelet cleans logs up and deletes this file while log exporter is reading them.

Kubelet can periodically update the metadata file even in case of no updates or it can always perform an update on startup and then stick to retries on update failures.

Pods' metadata lying around on the disk without any authorization

You do get filesystem based ACLs (uid, gid, etc). If for example all logs are owned by uid 0, then you will have to grant logging agents a specific privilege to access logs. How is pod metadata a concern when logging data is also stored in the same disk? Whatever ACLs we define for logs should apply to metadata too.

If we decide to implement dumping metadata on disk, we should stick with this interface and maintain it (and whole code in kubelet around it) when it can be avoided. The best code is no code at all, isn't it?

From a ecosystem standpoint, simplicity is key to success. A simple file based API for metadata that is co-located with logs is easy to discover, parse and consume for logging agents written in many different languages than having to understand kubernetes APIs, it's transport, etc. Most likely all that logging agents require are pod name, UID, namespace, and labels. If we really care about privacy, we can restrict metadata to only contain those fields.

It's OK to take on some additional logic into the system if it will simplify the workflow for many users/vendors.

One more thing to consider, watching the API server will not work if kubelet is run in standalone mode. At that point, you either need a kubelet REST endpoint (which would again require various language bindings), or a simple file based API that dumps v1.Pod objects.

crassirostris commented 7 years ago

@vishh By the way, @piosz noted that we also have monitoring and it also requires metadata. It actually makes sense to have a similar approach in both cases, especially considering that some agents can serve both metrics and logs.

What if shims do not need to perform log rotation?

Some entity has to perform log rotation. Not shim -> synchronization problem, that's my point.

Is this viable for all the agents out there?

For some agents it might be less, for some more. But it's not super-complex and has to be implemented only once per agent.

And what about auth? There is an open issue around restricting access to API objects from the node and kubelet might be restricted to only access the objects it needs to manage. Giving the same level of privileges to a logging agent is not ideal. With a file based API, logging agents need access to just the logging directory and do not need any other privileges.

But at the same time accessing API narrows the attack surface. @kubernetes/sig-auth-misc Could you please give your input here? What's better from the security point of view: let logging agent query API server for metadata or dump metadata to the /var/log/pods directory?

Because well-known log exporters don't work in this way (metadata separated from the log stream), there's no "file based API" for metadata, you have to implement it manually in each log exporter anyway

Can you list a few examples? I have tried adapting filebeat logging agent in the past for k8s and it was pretty straightforward to inject metadata from the filesystem and filebeat works out of box with ELK stack.

According to the filebeat documentation, it doesn't support filter plugins. I couldn't find anything like that in the options, could you please clarify what you mean by "pretty straightforward" or point to a specific parameter?

Logstash does have traslate filter , though it doesn't watch a file, it polls it with some interval. Also I'm not sure it's possible to set it up dynamically for each log file.

Syslog-ng supports enriching entries from an external file, but

Kubelet can periodically update the metadata file even in case of no updates or it can always perform an update on startup and then stick to retries on update failures.

Point is, there's no synchronization between kubelet and logging agent. Therefore we cannot guarantee that metadata file will be there each time logging agent wants to read it. Therefore logging agent HAS to handle failures, that's the only thing I wanted to point out.

You do get filesystem based ACLs (uid, gid, etc). If for example all logs are owned by uid 0, then you will have to grant logging agents a specific privilege to access logs. How is pod metadata a concern when logging data is also stored in the same disk? Whatever ACLs we define for logs should apply to metadata too.

I'm not saying it's impossible, I'm saying it's an attack surface that has to be protected.

From a ecosystem standpoint, simplicity is key to success. A simple file based API for metadata that is co-located with logs is easy to discover, parse and consume for logging agents written in many different languages than having to understand kubernetes APIs, it's transport, etc. Most likely all that logging agents require are pod name, UID, namespace, and labels. If we really care about privacy, we can restrict metadata to only contain those fields.

Simplicity for the end user of one aspect of the system is not the same as the simplicity of the whole system. Making it easier (not that much even) for the logging agent, you make life harder for at least kubelet and security teams. Making it slightly more sophisticated for the logging agent allows to make the overall design simpler.

Remember how initially the metadata was encoded in the log file path (still is)? That's the easiest interface possible, truly supported by most of the log exporters/aggregators. But at what cost in the end?

It's OK to take on some additional logic into the system if it will simplify the workflow for many users/vendors.

The question is about the right balance.

One more thing to consider, watching the API server will not work if kubelet is run in standalone mode. At that point, you either need a kubelet REST endpoint (which would again require various language bindings), or a simple file based API that dumps v1.Pod objects.

Do we actually care about this usecase in k8s? I'm not saying we're not, I just don't know

vishh commented 7 years ago

@vishh By the way, @piosz noted that we also have monitoring and it also requires metadata. It actually makes sense to have a similar approach in both cases, especially considering that some agents can serve both metrics and logs.

Yup. Having metadata in a file would help monitoring systems too.

Some entity has to perform log rotation. Not shim -> synchronization problem, that's my point.

Sure! I'm saying that there is a strong reason for having kubelet deal with rotation and not having runtimes perform rotation. I hope there is no disagreement on that.

According to the filebeat documentation, it doesn't support filter plugins. I couldn't find anything like that in the options, could you please clarify what you mean by "pretty straightforward" or point to a specific parameter?

Here is a partial patch that I started working on in the past to get filebeat to work for k8s and GCL - https://github.com/elastic/beats/compare/master...vishh:k8s-docker-logs?expand=1 In that logic, it will be straightforward to parse a json/yaml blob and convert fields to metadata.

As for other logging agents, they might support off the box, but I suspect we will see more logging integrations show up once the API is documented and is easy to use, similar to how heapster triggered multiple monitoring integrations.

Point is, there's no synchronization between kubelet and logging agent. Therefore we cannot guarantee that metadata file will be there each time logging agent wants to read it. Therefore logging agent HAS to handle failures, that's the only thing I wanted to point out.

I do agree that the logging agent has to be resilient to errors, although I;m not seeing how that would influence the API.

I'm not saying it's impossible, I'm saying it's an attack surface that has to be protected.

Definitely. As I said having access to logs is potentially more serious that just metadata (labels primarily).

Simplicity for the end user of one aspect of the system is not the same as the simplicity of the whole system. Making it easier (not that much even) for the logging agent, you make life harder for at least kubelet and security teams. Making it slightly more sophisticated for the logging agent allows to make the overall design simpler.

I guess it depends on the complexity and cost. I'm still failing to see concrete reasons as to why having kubelet dump a part of v1.Pod is incurring complexity. The node is doing far more complex things to simplify user experience.

Remember how initially the metadata was encoded in the log file path (still is)? That's the easiest interface possible, truly supported by most of the log exporters/aggregators. But at what cost in the end?

That was a decision made around v1.0 where the system needed to just work. Extensibility, robustness, reliability, etc where not requirements back then.

Do we actually care about this usecase in k8s? I'm not saying we're not, I just don't know

AFAIK It's not officially defined. Perhaps @dchen1107 might know the current status on that feature.

piosz commented 7 years ago

Yup. Having metadata in a file would help monitoring systems too.

I think it's very inconvenient for monitoring agents/components to read metadata from files.

crassirostris commented 7 years ago

@vishh Again, thanks for you comments!

Yup. Having metadata in a file would help monitoring systems too.

Then the monitoring agent has to have access to log directories on the host. That's possible, ofc, but counter-intuitive and even less secure. Absolutely agree with @piosz, that's inconvenient.

Sure! I'm saying that there is a strong reason for having kubelet deal with rotation and not having runtimes perform rotation. I hope there is no disagreement on that.

Whoa, sorry, I think we're talking about different things. I'm talking about something like containerd shim, which is part of the container runtime. I don't see a strong reason for having kubelet deal with rotation, I see exactly the opposite: there's a reason for having kubelet driving requirements for the rotation but having runtime handling actual rotations (e.g. via. max log file size parameter).

Here is a partial patch that I started working on in the past to get filebeat to work for k8s and GCL - https://github.com/elastic/beats/compare/master...vishh:k8s-docker-logs?expand=1 In that logic, it will be straightforward to parse a json/yaml blob and convert fields to metadata.

I really doubt beats team will add k8s support to the upstream, filebeats is a generic log parser, not associated with any product outside of Elastic infra. And anyway, ofc it would be straightforward, you implemented this in code :) Once you start writing code, that's not a straightforward integration. If you have to write code anyway, querying API server is not significantly harder than watching a file.

I do agree that the logging agent has to be resilient to errors, although I;m not seeing how that would influence the API.

Sorry for the confusion. You said that dumping metadata to a file can help to avoid network outages and I wanted to say that it's true, but we will still have some outages and errors in any solution.

I'm not saying it's impossible, I'm saying it's an attack surface that has to be protected.

Definitely. As I said having access to logs is potentially more serious that just metadata (labels primarily).

I absolutely agree! Therefore we cannot allow any entity except a logging agent access logs. Therefore we would need a different set of permissions for the metadata and logs (for the monitoring agent, right?), which makes the whole system even more complicated.

I guess it depends on the complexity and cost. I'm still failing to see concrete reasons as to why having kubelet dump a part of v1.Pod is incurring complexity. The node is doing far more complex things to simplify user experience.

That was a decision made around v1.0 where the system needed to just work. Extensibility, robustness, reliability, etc where not requirements back then.

Exactly! We want to make things right now. IMHO, simplicity and unification are important things on the way to extensibility, robustness, reliability, etc.

AFAIK It's not officially defined. Perhaps @dchen1107 might know the current status on that feature.

@dchen1107, I would really appreciate your feedback on this point. If it's something that we have to consider, that would be important to know now. If it's not something that will be addressed soon, maybe it can wait until kubelet has API?

vishh commented 7 years ago

I feel we have reached the limits of github with this conversation. I will setup a meeting to discuss this further. I primarily care about storage isolation and I was only trying to provide some direction based on my past experience around metadata.

On Thu, Apr 6, 2017 at 1:59 PM, Mik Vyatskov notifications@github.com wrote:

@vishh https://github.com/vishh Again, thanks for you comments!

Yup. Having metadata in a file would help monitoring systems too.

Then the monitoring agent has to have access to log directories on the host. That's possible, ofc, but counter-intuitive and even less secure. Absolutely agree with @piosz https://github.com/piosz, that's inconvenient.

Sure! I'm saying that there is a strong reason for having kubelet deal with rotation and not having runtimes perform rotation. I hope there is no disagreement on that.

Whoa, sorry, I think we're talking about different things. I'm talking about something like containerd shim https://github.com/containerd/containerd/blob/master/design/lifecycle.md#shim, which is part of the container runtime. I don't see a strong reason for having kubelet deal with rotation, I see exactly the opposite: there's a reason for having kubelet driving requirements for the rotation but having runtime handling actual rotations (e.g. via. max log file size parameter).

Here is a partial patch that I started working on in the past to get filebeat to work for k8s and GCL - https://github.com/elastic/ beats/compare/master...vishh:k8s-docker-logs?expand=1 In that logic, it will be straightforward to parse a json/yaml blob and convert fields to metadata.

I really doubt beats team will add k8s support to the upstream, filebeats is a generic log parser, not associated with any product outside of Elastic infra. And anyway, ofc it would be straightforward, you implemented this in code :) Once you start writing code, that's not a straightforward integration. If you have to write code anyway, querying API server is not significantly harder than watching a file.

I do agree that the logging agent has to be resilient to errors, although I;m not seeing how that would influence the API.

Sorry for the confusion. You said that dumping metadata to a file can help to avoid network outages and I wanted to say that it's true, but we will still have some outages and errors in any solution.

I'm not saying it's impossible, I'm saying it's an attack surface that has to be protected.

Definitely. As I said having access to logs is potentially more serious that just metadata (labels primarily).

I absolutely agree! Therefore we cannot allow any entity except a logging agent access logs. Therefore we would need a different set of permissions for the metadata and logs (for the monitoring agent, right?), which makes the whole system even more complicated.

I guess it depends on the complexity and cost. I'm still failing to see concrete reasons as to why having kubelet dump a part of v1.Pod is incurring complexity. The node is doing far more complex things to simplify user experience.

  • Because it can be avoided without significant changes
  • Because it creates separate pipelines for different components doing exactly the same thing. Imagine a logging agent in a sidecar container, it shouldn't have access to the host filesystem, but it needs metadata anyway. Imagine a system, where logs are enriched not on the node where they're produced (that's quite common setup with logstash and fluentd), aggregator just doesn't have metadata in files on the node
  • Because this simplification is insignificant. How does it simplify the overall process of integration with Kubernetes? It doesn't, you still have to write a plugin/extension for a logging agent. How does it change the code? 50 less lines? 100? Once per logging agent? IMHO, that's not something worth introducing changes in the kubelet.

That was a decision made around v1.0 where the system needed to just work. Extensibility, robustness, reliability, etc where not requirements back then.

Exactly! We want to make things right now. IMHO, simplicity and unification are important things on the way to extensibility, robustness, reliability, etc.

AFAIK It's not officially defined. Perhaps @dchen1107 https://github.com/dchen1107 might know the current status on that feature.

@dchen1107 https://github.com/dchen1107, I would really appreciate your feedback on this point. If it's something that we have to consider, that would be important to know now. If it's not something that will be addressed soon, maybe it can wait until kubelet has API?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/42718#issuecomment-292317852, or mute the thread https://github.com/notifications/unsubscribe-auth/AGvIKHG1vUuIz3Xdv3StFWYfhlnfeLnzks5rtVIhgaJpZM4MWhNR .

thockin commented 7 years ago

I can't follow the thread - please include me.

On Fri, Apr 7, 2017 at 11:23 AM, Vish Kannan notifications@github.com wrote:

I feel we have reached the limits of github with this conversation. I will setup a meeting to discuss this further. I primarily care about storage isolation and I was only trying to provide some direction based on my past experience around metadata.

On Thu, Apr 6, 2017 at 1:59 PM, Mik Vyatskov notifications@github.com wrote:

@vishh https://github.com/vishh Again, thanks for you comments!

Yup. Having metadata in a file would help monitoring systems too.

Then the monitoring agent has to have access to log directories on the host. That's possible, ofc, but counter-intuitive and even less secure. Absolutely agree with @piosz https://github.com/piosz, that's inconvenient.

Sure! I'm saying that there is a strong reason for having kubelet deal with rotation and not having runtimes perform rotation. I hope there is no disagreement on that.

Whoa, sorry, I think we're talking about different things. I'm talking about something like containerd shim https://github.com/containerd/containerd/blob/ master/design/lifecycle.md#shim,

which is part of the container runtime. I don't see a strong reason for having kubelet deal with rotation, I see exactly the opposite: there's a reason for having kubelet driving requirements for the rotation but having runtime handling actual rotations (e.g. via. max log file size parameter).

Here is a partial patch that I started working on in the past to get filebeat to work for k8s and GCL - https://github.com/elastic/ beats/compare/master...vishh:k8s-docker-logs?expand=1 In that logic, it will be straightforward to parse a json/yaml blob and convert fields to metadata.

I really doubt beats team will add k8s support to the upstream, filebeats is a generic log parser, not associated with any product outside of Elastic infra. And anyway, ofc it would be straightforward, you implemented this in code :) Once you start writing code, that's not a straightforward integration. If you have to write code anyway, querying API server is not significantly harder than watching a file.

I do agree that the logging agent has to be resilient to errors, although I;m not seeing how that would influence the API.

Sorry for the confusion. You said that dumping metadata to a file can help to avoid network outages and I wanted to say that it's true, but we will still have some outages and errors in any solution.

I'm not saying it's impossible, I'm saying it's an attack surface that has to be protected.

Definitely. As I said having access to logs is potentially more serious that just metadata (labels primarily).

I absolutely agree! Therefore we cannot allow any entity except a logging agent access logs. Therefore we would need a different set of permissions for the metadata and logs (for the monitoring agent, right?), which makes the whole system even more complicated.

I guess it depends on the complexity and cost. I'm still failing to see concrete reasons as to why having kubelet dump a part of v1.Pod is incurring complexity. The node is doing far more complex things to simplify user experience.

  • Because it can be avoided without significant changes
  • Because it creates separate pipelines for different components doing exactly the same thing. Imagine a logging agent in a sidecar container, it shouldn't have access to the host filesystem, but it needs metadata anyway. Imagine a system, where logs are enriched not on the node where they're produced (that's quite common setup with logstash and fluentd), aggregator just doesn't have metadata in files on the node
  • Because this simplification is insignificant. How does it simplify the overall process of integration with Kubernetes? It doesn't, you still have to write a plugin/extension for a logging agent. How does it change the code? 50 less lines? 100? Once per logging agent? IMHO, that's not something worth introducing changes in the kubelet.

That was a decision made around v1.0 where the system needed to just work. Extensibility, robustness, reliability, etc where not requirements back then.

Exactly! We want to make things right now. IMHO, simplicity and unification are important things on the way to extensibility, robustness, reliability, etc.

AFAIK It's not officially defined. Perhaps @dchen1107 https://github.com/dchen1107 might know the current status on that feature.

@dchen1107 https://github.com/dchen1107, I would really appreciate your feedback on this point. If it's something that we have to consider, that would be important to know now. If it's not something that will be addressed soon, maybe it can wait until kubelet has API?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/ 42718#issuecomment-292317852, or mute the thread https://github.com/notifications/unsubscribe-auth/ AGvIKHG1vUuIz3Xdv3StFWYfhlnfeLnzks5rtVIhgaJpZM4MWhNR .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/42718#issuecomment-292614105, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVM3NROdpEOwalJB7uOAWvTXxDnWNks5rtn8NgaJpZM4MWhNR .

davidopp commented 7 years ago

Can someone summarize the outcome of the meeting that presumably took place?

crassirostris commented 7 years ago

@davidopp The meeting hasn't yet taken place, I was OOO this week. I will post the outcomes once the meeting happens

coffeepac commented 7 years ago

@crassirostris if possible I would also like to be included. If too much hassle, no worries.

stevesloka commented 7 years ago

I have an interest in this as well, just posted some questions to Slack before finding this. Please include me in a meeting. Thanks!

timothysc commented 7 years ago

/cc @kensimon