hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.88k stars 1.95k forks source link

Feature Idea: Better Support for 3rd party logging solutions. #9211

Open the-maldridge opened 3 years ago

the-maldridge commented 3 years ago

Logs are something that should just be automagic in a clustered environment. Teams shouldn't need to think about them and they should be managed by the infrastructure. Supporting these systems also shouldn't break nomad's built-in log management, so changing things like the global docker logging configuration isn't really a solution.

I propose a patch to logmon to allow it to dupe the log streams to a second well-known location. This location should name the files something like {{.NodeName}}-{{.Job}}-{{.Group}}-{{.Task}}-{{.AllocID}}. Bonus points for making this a templatable string!

I don't think this is a particularly complex feature to support, but it would open up running log management solutions like loki as one-click options within nomad. More importantly, it would generally allow other log indexing to work without breaking build in logs commands.

apollo13 commented 3 years ago

This would indeed be a very welcome addition. I do not care if it is a simple duplication via files or a plugin (?) which gets a copy of all the logs and can do whatever it wants.

If it is/were a fd duplication there are probably a few more things to consider:

A plugin would also be nice, but the question is how the interface would look like and how nomad would start it. Similar to CSI plugins or via configuration in the config files (ie just specify a socket endpoint?)

the-maldridge commented 3 years ago

I was trying to keep this as simple as possible, so I assumed the retention would be the same as the other copies of the log files. if I task needs to slurp them it needs to be using inotify or some other means to notice them.

apollo13 commented 3 years ago

To be honest I do not know the current retention times. But if it survives alloc shutdown + x minutes that would be fine for most systems.

towe75 commented 3 years ago

I stumbled upon this while working on podman driver logging options. What would you think about a external service to stream the logs from nomad to e.g. a loki server or ELK stack? Nowadays we have the nomad event stream, so we can easily learn about started/stopped allocations and simply stream and transform the logs for each alloc.

IMHO a external service is a better choice here because nomads API allows for a good integration and the log streamer can grow independently. Job meta data could be used to enrich the logs with custom fields/categories and also to select or filter allocs eligible for log forwarding.

Any opinions? Do you think it would be useful?

sofixa commented 3 years ago

@towe75 a few already exist, like this one https://github.com/sas1024/nomad_follower

the-maldridge commented 3 years ago

@towe75 I think the idea you suggest of using the event stream is a pretty good one. I'd want to have it as something that could be run as a system level task across all hosts though, since the goal here was to have something that would make cluster wide logging automatic.

towe75 commented 3 years ago

@sofixa : thank you, i will have a look at it. @the-maldridge : yes, sure. But conceptionally it does not really matter and it can be a event filter by node-id, in example. Performance wise, however, it's surely a good idea to break log shipping into several sinks. I will see if i can come up with some POC if my time allows it.

apollo13 commented 3 years ago

I wonder if that wouldn't overwhelm the event stream. Logs are probably even noisier than what the event stream usually transports

towe75 commented 3 years ago

@apollo13 To clarify: the eventstream itself does not provide the logs. It only helps to track allocation startup/teardown. The regular allocfs api can then stream the logs.

It might, however, also be interesting to treat parts of the eventstream as structured log on it's own. Loki and other log aggregators can cope pretty well with any structured data, not just logs. So the log viewer could show internal messages from the eventstream (e.g. alloc xzy started) followed by the actual logs and finally some alloc teardown event. This is surely nice for batch jobs.

apollo13 commented 3 years ago

@apollo13 To clarify: the eventstream itself does not provide the logs. It only helps to track allocation startup/teardown. The regular allocfs api can then stream the logs.

Thanks, I indeed missed that part.

tgross commented 1 year ago

Some more context and feature ideas: https://github.com/hashicorp/nomad/issues/17366

apollo13 commented 1 year ago

@tgross Did you ever make some progress here aside from the WIP branch (https://github.com/hashicorp/nomad/blob/4667539b31710b0243f30719db3b88e7a7e83b98/plugins/logging/README.md) in 2022?

tgross commented 1 year ago

That experiment got turned into an internal design document (RFC) that's spurred some good discussion but it hasn't quite made it "over the line" when planning releases yet.