vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.57k stars 1.54k forks source link

Parse out `namespace` #4863

Open ktff opened 3 years ago

ktff commented 3 years ago

With #4833 all but prometheus and statsd sources set namespace. For those two we can parse out the namespace from the name, but an audit of all official Prometheus exporters is needed to verify that the conventions are being uphold. A similar research should be done for statsd.

We should also watch out for multi word namespaces if they show up in the audit.

These feature should be togglable where the default state depends on how upholded is the convention.

binarylogic commented 3 years ago

@ktff yeah, I don't know how we'd detect multi-word namespaces. Do you have any good ideas that would prevent false positives?

ktff commented 3 years ago

I don't know and don't think there is an 100% correct automatic way to detect that, so if they are used the best we can do is to give users the tools to handle it. Something like global option in which such multi word namespaces user could list and which we would parse out as such. That would ensure there are no false positives, just false negatives.

binarylogic commented 3 years ago

That makes sense. I'm also wondering if we could survey the batch of metrics scraped to make better guesses? For example, if multiple metrics share the same multi-word prefix. That can be a follow up enhancement though.

ktff commented 3 years ago

There are multiple approaches we can take. The simplest that I see is: let's assume the first word is namespace then in a batch(or after we have observed some number of metrics) if the combination of first word + second word occurs the same number of times as the first word then we can assume the second word is part of the namespace. This also applies for subsequent words. After that point we can still track if the above holds.

The amount of false positives can then be controlled with the amount of metrics that need to be observed before we make the decision.

Although this wont catch two or more namespaces which share first and or subsequent words.