grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.43k stars 3.4k forks source link

Hashicorp's Nomad discovery for Promtail #5464

Closed m1keil closed 2 years ago

m1keil commented 2 years ago

We would like to scrape Nomad's scheduled workloads with Promtail.

Describe the solution you'd like

Adding a new scrape_configs option - nomad_sd_config, similar to the already existing kubernetes_sd_config. This would enable promtail agent to scrape Nomad's client REST API for information about the currently running workloads on the host.

Describe alternatives you've considered

  1. Use/write a custom service that continuously scrapes Nomad API and generates a config file that is suitable for static_config (example)
  2. Use Consul discovery (our current choice).
  3. Use the log shipper pattern and run promtail as a sidecar per service. Use stratic_config again to pinpoint the details of the log files.
  4. Using the docker logging driver.
  5. Using a different log shipper.

Additional context Using Consul discovery is our current way of doing this but it does introduce a number of shortcomings:

  1. Every service that we want to collect logs from is required to be registered into Consul catalogue.
  2. This doesn't work well for short-lived workloads as services are being missed to get discovered.
  3. Limited metadata is passed between Nomad and Consul by default which prevents promtail from accessing a more interesting and useful metadata.
trevorwhitney commented 2 years ago

Hey @m1keil, this has come up before, and might be a good thing to pursue. I'm curious about a few things first though.

First, out of curiosity, did you know Promtail has a consul agent service discovery mechanism as well? This uses the API of the agent co-located on the same node as promtail. This might alleviate some of the issues you're having with the Consul catalogue API, though services still need to be registered with the agent. Is it possible to have a Nomad deployment that's not registered with consul?

Second, do you know what the current gap is between Consul and Nomad currently? What is some of the metadata you'd like to have that you're missing out on?

Finally, how does Nomad handle short-lived services? Do those not get registered with the consul agent? I'm curious to learn a bit more about how using the Nomad API would avoid missing these short-lived processes.

m1keil commented 2 years ago

Yes, we are currently using the consulagent SD.

Is it possible to have a Nomad deployment that's not registered with consul?

Nomad has good integration with Consul but it's not automatic. Each Nomad client will register to Consul automatically. However, any workloads that you run must register themselves via the service{} stanza of the Nomad job. Some services might opt out from the registration if service discovery isn't required for them.

What is some of the metadata you'd like to have that you're missing out on?

Primary Nomad's metadata. With the current integration, you can get Nomad's task name and allocation ID. But you won't be able to reliably get the job name or the group's name or have the Nomad's meta{} data.

Meta{} is an interesting one in itself. Nomad includes its own metadata definition you can define on different levels (Job/Group/Task). It doesn't get passed automatically to Consul's service meta. It's an entirely different thing.

Is it possible to work around this? Yes. Is it kinda ugly? I think so :\

Finally, how does Nomad handle short-lived services?

I think the problem is that Consulagent doesn't pick up the service in time. For example, I have a small backup batch job that runs for 15 seconds every few hours. Even though the service registers in Consul, it seems like promtail doesn't detect it. I took a quick peak in the code and from what I understand, promtail is supposed to use blocking queries and theoretically, this should be detected.. but in practice it seems like something is missing.

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

nahsi commented 2 years ago

I'm interested in this too. Currently I use docker_sd_config - example usage can be seen in this issue. But it doesn't work well with Nomad bridge and obvously it doesn't work with drivers other than docker like exec.

trevorwhitney commented 2 years ago

Unfortunately at this time we don't have enough Nomad usage to really push this one up the priority list. If there is someone from the community running on nomad, a PR for this service discovery would be greatly appreciated. either here with the intention of up-streaming to prometheus, or to prometheus directly.

m1keil commented 2 years ago

Thanks @trevorwhitney! I will check with Prometheus folks if there is an interest.

AAverin commented 11 months ago

Any plans to reopen this? Prometheus does support nomad_sd_configs already, would be great to get the same for promtail.

AAverin commented 11 months ago

@m1keil did you have any progress asking prometheus devs? I am the point where everything in my cluster works in pure Nomad except promtail + loki combination. Don't really want to bring Consul just because of that.