Closed nikgrok closed 8 years ago
My stance on this has always been that duplicate target configuration has to be resolved by the user.
If targets are duplicate with different static labels, Prometheus cannot decide for you which one is "the right one". If you configure your targets or SD within one job, they have distinct information about how they were retrieved. This information is used to preserve target states across reloads. It would be possible to check whether no relabeling rules modify based on the SD information and then dedupe targets based on that. This involves a lot of implementation complication.
Could you elaborate why the very same target shows up in two distinct services for you? This doesn't seem like a desirable configuration to me.
If at the end of all the SD and processing, there are completely identical duplicate targets among all jobs it seems reasonable to me to scrape only one of them. By identical I mean all labels, scheme, basic auth, interval, metrics path and anything else. Metric relabelling would be the one wrinkle to this.
"seems reasonable" is not good enough given how invasive this change would be. Only with very good reasons where this problem occurs with sane SD setups.
Our use case is as follows. We'd like to monitor Hadoop machines. We have a service for the Task Tracker, and the Data node, as well as Namenode. For me to set up a node exporter instance to scrape all machines I have no way of not having repeats (since some machines can have a task tracker and a datanode, while others are just task tracker). Today, I have no way of uniquely monitoring all Hadoop servers because of this. The only other option is to put both services and almost doubling my number of scrapes.
A simple Merge would fix it completely
On Wed, Jul 8, 2015 at 1:28 PM, Fabian Reinartz notifications@github.com wrote:
"seems reasonable" is not good enough given how invasive this change would be. Only with very good reasons where this problem occurs with sane SD setups.
— Reply to this email directly or view it on GitHub https://github.com/prometheus/prometheus/issues/884#issuecomment-119670875 .
So is this about running the node exporter? If so, exporting node metrics has no direct relation to Hadoop running on them. Node exporters should probably have their own service. I assume you want to know that these nodes belong to your Hadoop cluster, which Consul tags and relabeling can solve for you.
In my opinion it doesn't boil down to a "simple merge", unfortunately. But PRs are always welcome – a place to start would probably be to detach the scraping from the targets.
I think this is a valid request. At SC we don't encounter this because our cookbook forces us to work around it anyway, by creating a single service discovery entry for every possible group we want to change.
But I do see the need to "get the node-level statistics from all nodes of groups A and B" where A and B may or may not overlap. It is currently impossible to do this without either duplicating the intersection, or creating another entry for the overlap, which is nothing other than de-duplicating outside of Prometheus.
To be clear, I think scraping the same instance from several jobs should not be merged, but if I specify
dns_sd_configs:
- some_target.some_domain
- other_target.other_domain
and one endpoint is listed in both SRV records, it should only be scraped and recorded once. The same goes for other SD methods.
(actually, why isn't this deduplicating at the storage layer?)
(actually, why isn't this deduplicating at the storage layer?)
Getting two scrapes of the same target at different time offsets is going to be confusing, and can't be de-duplicated at the storage layer due to the different timestamps.
But it should end up as one time series with double the sampling rate?
@nikgrok could you describe how the duplication manifests for you?
But it should end up as one time series with double the sampling rate?
Yes. Probably interesting interactions with #394 too.
Hey guys, Trying to restart this discussion. @matthiasr, Basically today if we use the consul config 2 services with a machine in both will be scraped 2 times
Example: Hadoop Datanode srv1 srv2 srv3
Hadoop Tasktrakcer srv2 srv3.
Total scraped machines srv1:9100 srv2:9100 srv3:9100 srv2:9100 srv3:9100
I guess the discussion ended at "but why is that so bad?". @fabxc can tell us how invasive the de-duping would be.
Thanks to an in-person discussion with @matthiasr I better understand why the scenario in general is reasonable.
The question indeed remains "but why is that so bad?". There is an increased sampling rate but no race or anything like that. Is the impact on the storage relevant? It's certainly a brain-itch that should be fixed eventually, but none an end user would usually see.
Scraping would essentially have to be detached from targets. This would solve the double scraping but the target would still be discovered multiple times. If the latter is the actual thing that's annoying people, there are easier ways to fix that at the UI level.
@nikgrok It would be good to know which part actually causes frustration for you.
I imagine multiple endpoints for storage will be created (in the case of 2 services with the 90% similarity), you are increasing the space required by 90%. Am I wrong about that?
Also, it means that the collectors will be scraping 90% more and we already plan on really hammering prometheus. (Our hadoop infrastructure is 3.5k machines already)
That sounds like a very legitimate use case. I'll try to get to this rather sooner than later.
On Thu, Aug 6, 2015, 7:12 PM nikgrok notifications@github.com wrote:
I imagine multiple endpoints for storage will be created (in the case of 2 services with the 90% similarity), you are increasing the space required by 90%. Am I wrong about that?
Also, it means that the collectors will be scraping 90% more and we already plan on really hammering prometheus. (Our hadoop infrastructure is 3.5k machines already)
— Reply to this email directly or view it on GitHub https://github.com/prometheus/prometheus/issues/884#issuecomment-128447689 .
Why do you use the consul results for the tasktracker/datanode to discover node exporters? If you register a separate service for the node exporter, it would only be scraped once. Using your example, shouldn't it be:
Hadoop Datanode
srv1:PORTDATA srv2:PORTDATA srv3:PORTDATA
Hadoop Tasktrakcer
srv2:PORTTASK srv3:PORTTASK
Node exporters
srv1:9100 srv2:9100 srv3:9100
Total scraped machines
srv1:PORTDATA
srv1:9100
srv2:PORTDATA
srv2:PORTTASK
srv2:9100
srv3:PORTDATA
srv3:PORTTASK
srv3:9100
EDIT: I understood @matthiasr's comment now. If it's about having a common label everywhere, an additional label could be applied to all three (datanode, tasktracker, node exporter) jobs?
@nikgrok #1064 implements this. You might want to give it a try.
Fixed in 0.18.0rc1
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Hey guys,
Today if you are using service discovery (consul), querying 2 services in one job will return you duplicate scrapes. If possible, can you make scrapes with the same address in the same job unique?