prometheus / prometheus

The Prometheus monitoring system and time series database.
https://prometheus.io/
Apache License 2.0
55.24k stars 9.09k forks source link

Deduping Identical targets #884

Closed nikgrok closed 8 years ago

nikgrok commented 9 years ago

Hey guys,

Today if you are using service discovery (consul), querying 2 services in one job will return you duplicate scrapes. If possible, can you make scrapes with the same address in the same job unique?

fabxc commented 9 years ago

My stance on this has always been that duplicate target configuration has to be resolved by the user.

If targets are duplicate with different static labels, Prometheus cannot decide for you which one is "the right one". If you configure your targets or SD within one job, they have distinct information about how they were retrieved. This information is used to preserve target states across reloads. It would be possible to check whether no relabeling rules modify based on the SD information and then dedupe targets based on that. This involves a lot of implementation complication.

Could you elaborate why the very same target shows up in two distinct services for you? This doesn't seem like a desirable configuration to me.

brian-brazil commented 9 years ago

If at the end of all the SD and processing, there are completely identical duplicate targets among all jobs it seems reasonable to me to scrape only one of them. By identical I mean all labels, scheme, basic auth, interval, metrics path and anything else. Metric relabelling would be the one wrinkle to this.

fabxc commented 9 years ago

"seems reasonable" is not good enough given how invasive this change would be. Only with very good reasons where this problem occurs with sane SD setups.

nikgrok commented 9 years ago

Our use case is as follows. We'd like to monitor Hadoop machines. We have a service for the Task Tracker, and the Data node, as well as Namenode. For me to set up a node exporter instance to scrape all machines I have no way of not having repeats (since some machines can have a task tracker and a datanode, while others are just task tracker). Today, I have no way of uniquely monitoring all Hadoop servers because of this. The only other option is to put both services and almost doubling my number of scrapes.

A simple Merge would fix it completely

On Wed, Jul 8, 2015 at 1:28 PM, Fabian Reinartz notifications@github.com wrote:

"seems reasonable" is not good enough given how invasive this change would be. Only with very good reasons where this problem occurs with sane SD setups.

— Reply to this email directly or view it on GitHub https://github.com/prometheus/prometheus/issues/884#issuecomment-119670875 .

fabxc commented 9 years ago

So is this about running the node exporter? If so, exporting node metrics has no direct relation to Hadoop running on them. Node exporters should probably have their own service. I assume you want to know that these nodes belong to your Hadoop cluster, which Consul tags and relabeling can solve for you.

In my opinion it doesn't boil down to a "simple merge", unfortunately. But PRs are always welcome – a place to start would probably be to detach the scraping from the targets.

matthiasr commented 9 years ago

I think this is a valid request. At SC we don't encounter this because our cookbook forces us to work around it anyway, by creating a single service discovery entry for every possible group we want to change.

But I do see the need to "get the node-level statistics from all nodes of groups A and B" where A and B may or may not overlap. It is currently impossible to do this without either duplicating the intersection, or creating another entry for the overlap, which is nothing other than de-duplicating outside of Prometheus.

To be clear, I think scraping the same instance from several jobs should not be merged, but if I specify

dns_sd_configs:
  - some_target.some_domain
  - other_target.other_domain

and one endpoint is listed in both SRV records, it should only be scraped and recorded once. The same goes for other SD methods.

matthiasr commented 9 years ago

(actually, why isn't this deduplicating at the storage layer?)

brian-brazil commented 9 years ago

(actually, why isn't this deduplicating at the storage layer?)

Getting two scrapes of the same target at different time offsets is going to be confusing, and can't be de-duplicated at the storage layer due to the different timestamps.

matthiasr commented 9 years ago

But it should end up as one time series with double the sampling rate?

@nikgrok could you describe how the duplication manifests for you?

brian-brazil commented 9 years ago

But it should end up as one time series with double the sampling rate?

Yes. Probably interesting interactions with #394 too.

nikgrok commented 9 years ago

Hey guys, Trying to restart this discussion. @matthiasr, Basically today if we use the consul config 2 services with a machine in both will be scraped 2 times

Example: Hadoop Datanode srv1 srv2 srv3

Hadoop Tasktrakcer srv2 srv3.

Total scraped machines srv1:9100 srv2:9100 srv3:9100 srv2:9100 srv3:9100

beorn7 commented 9 years ago

I guess the discussion ended at "but why is that so bad?". @fabxc can tell us how invasive the de-duping would be.

fabxc commented 9 years ago

Thanks to an in-person discussion with @matthiasr I better understand why the scenario in general is reasonable.

The question indeed remains "but why is that so bad?". There is an increased sampling rate but no race or anything like that. Is the impact on the storage relevant? It's certainly a brain-itch that should be fixed eventually, but none an end user would usually see.

Scraping would essentially have to be detached from targets. This would solve the double scraping but the target would still be discovered multiple times. If the latter is the actual thing that's annoying people, there are easier ways to fix that at the UI level.

@nikgrok It would be good to know which part actually causes frustration for you.

nikgrok commented 9 years ago

I imagine multiple endpoints for storage will be created (in the case of 2 services with the 90% similarity), you are increasing the space required by 90%. Am I wrong about that?

Also, it means that the collectors will be scraping 90% more and we already plan on really hammering prometheus. (Our hadoop infrastructure is 3.5k machines already)

fabxc commented 9 years ago

That sounds like a very legitimate use case. I'll try to get to this rather sooner than later.

On Thu, Aug 6, 2015, 7:12 PM nikgrok notifications@github.com wrote:

I imagine multiple endpoints for storage will be created (in the case of 2 services with the 90% similarity), you are increasing the space required by 90%. Am I wrong about that?

Also, it means that the collectors will be scraping 90% more and we already plan on really hammering prometheus. (Our hadoop infrastructure is 3.5k machines already)

— Reply to this email directly or view it on GitHub https://github.com/prometheus/prometheus/issues/884#issuecomment-128447689 .

grobie commented 9 years ago

Why do you use the consul results for the tasktracker/datanode to discover node exporters? If you register a separate service for the node exporter, it would only be scraped once. Using your example, shouldn't it be:

Hadoop Datanode
srv1:PORTDATA srv2:PORTDATA srv3:PORTDATA

Hadoop Tasktrakcer
srv2:PORTTASK srv3:PORTTASK

Node exporters
srv1:9100 srv2:9100 srv3:9100

Total scraped machines
srv1:PORTDATA
srv1:9100
srv2:PORTDATA
srv2:PORTTASK
srv2:9100
srv3:PORTDATA
srv3:PORTTASK
srv3:9100

EDIT: I understood @matthiasr's comment now. If it's about having a common label everywhere, an additional label could be applied to all three (datanode, tasktracker, node exporter) jobs?

fabxc commented 9 years ago

@nikgrok #1064 implements this. You might want to give it a try.

fabxc commented 8 years ago

Fixed in 0.18.0rc1

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.