micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.
https://micrometer.io
Apache License 2.0
4.47k stars 990 forks source link

Support Elastic #154

Closed jkschneider closed 6 years ago

jkschneider commented 7 years ago

See the docs. This was raised in https://github.com/spring-projects/spring-boot/issues/10449.

antonyb commented 6 years ago

Metricbeat 6.2 will introduce a "dedot.enabled" flag to the HTTP module. It'll convert dot (".") in field names to underscore ("_").

https://www.elastic.co/guide/en/beats/metricbeat/6.2/metricbeat-metricset-http-json.html#_dedot_enabled

punkstarman commented 6 years ago

From a technical point of view, the problem for MetricBeat + ElasticSearch will be solved by this dedot.enabled flag.

However, from a "business" viewpoint, it doesn't seem right. Wouldn't it be possible to solve the problem by adding a suffix (e.g. .total, .max, .value) to a key whenever other keys contain the first as a prefix? For instance, change mem to mem.total because mem.free is present.

Just as a reminder, the problem arises from the fact that ElasticSearch interprets dots in the keys as map indexing (e.g. mem.total = 4, mem.free = 2 becomes { mem: { total = 4, free = 2 } }). Thus after reading mem = 4, mem.free = 2, ElasticSearch create an object with value { mem: 4 } then tries to introduce a sub field named free into the mem field. This fails because mem contains a scalar type when it needs to contain an object type.

I feel that replacing dots with underscores is just a workaround because we then lose the hierarchical information which is very useful to most users.

jkschneider commented 6 years ago

Yikes. I wonder if we have OOTB binders that already exhibit this problem?

jkschneider commented 6 years ago

we then lose the hierarchical information which is very useful to most users.

I wonder how useful it really is. In hierarchical metrics systems, hierarchical information is essentially used to encode tag information. But in dimensional systems, naturally that isn't the case. I don't know which MetricBeat + Elasticsearch is having not looked closely at this yet.

In a dimensional system, you just select a metric name and go from there.

If we do decide underscores are better, we don't really need to wait for MetricBeat 6.2. We would just choose snake case as the default naming convention in Micrometer for MetricBeat.

punkstarman commented 6 years ago

From what I understand of dimensional systems (most of my experience comes from working with OLAP), dimensions can be hierarchical (e.g. location) with different levels (e.g. continent, country, region, city). Datapoints in this system form a hypercube where each side is a dimension. One seldom analyzes datapoints at the lowest level and on all dimensions: queries often roll-up, drill-down or slice dimensions. Thus each dimension must define an aggregation operation (e.g. sum, max, min, average).

This seams to fit metrics. Memory can be a hypercube with two dimensions: time and allocation state. mem.free = 2 reports a datapoint at a given time for memory that is in the free state. mem = 4 gives an aggregate of data-points where the allocation state dimension is rolled up. The appropriate aggregation operation for memory would be a sum. Thus mem = 4 reports the total amount of memory regardless of allocation state.

HTTP reponses can form a hypercube with many more dimensions: time, client IP, client location, route, controller, HTTP status to name a few.

Hierarchy in metrics is useful because it allows the user to know what can be rolled up, what can be drilled down. For instance, how would you determine which metrics make sense in a pie chart? If you select metrics at the same level of a hierarchy, the question is easy. mem doesn't go with mem.free but mem.allocated does.

checketts commented 6 years ago

In other systems, like Prometheus, metrics that part of the same whole have the same name, and distinguish the composition by their tags. Take the 'jvm_heap' metric as an example.

It has tags that indicate which heap it is reporting on, and allows aggregation or charting on those tags.

Are you saying metricbeat needs hierarchical naming for aggregation or if you are just mentioning the use case to as an improvement to the user experience when querying?

punkstarman commented 6 years ago

I'm not saying that MetricBeat needs hierarchical information as a technical requirement.

I am saying a user might want them in order to more easily understand the metrics, analyze them, and set up dashboards in Kibana (as part of an ElasticSearch + MetricBeat + Kibana stack).

jkschneider commented 6 years ago

@punkstarman I think I understand your point. After all, if all cache metrics start with cache., some UIs will take advantage of that implicit hierarchy to group cache-related metrics, even if that has little impact on the technical capabilities of the system with regard to charting different cache metrics.

Do you think, to avoid the prefix problem, that we are better off using snake or camel casing on names, even if it means sacrificing this kind of presentation? Bear in mind that the naming convention on a particular registry implementation can always be set by the user. So even if we made the default snake case, "expert" users could combine a dot naming convention with potentially a meter filter that does just-in-time suffixing.

I guess I fear the general problem of adding a suffix as a default Micrometer behavior. Suppose we add .total as the suffix, like in your example. What happens when we encounter user-defined metrics mem and mem.total? I can imagine, however, that if you have enough control of your metrics that you know this won't be a problem, then this solution is more ideal.

punkstarman commented 6 years ago

@jkschneider Can you give me at least until this Sunday to set up a PoC project on Github with the latest milestone of Spring Boot and the latest releases of MetricBeat, ElasticSearch and Kibana to show the problem and better reply to your question?

I first reported this issue to the Spring Boot project after finding limitations with the old Spring Boot 1.x stack, then I did a test with Spring Boot 2.x that had another problem. If I remember though, the new metrics system with Micrometer introduced changes in the schema which may change the way to address the problem.

Before spending more time, I'd like to revisit the issue.

jkschneider commented 6 years ago

Sure thing. And to be clear, we haven't yet written MetricBeat exposition yet. So if you previously scraped the actuator endpoint to send metrics to MetricBeat, just keep in mind that this wouldn't be how we would approach it.

If you could let me know what the ideal representation is to maximize MetricBeat/ES/Kibana's capabilities, that's what we'll do.

punkstarman commented 6 years ago

Will do. I'll keep this issue posted.

punkstarman commented 6 years ago

OK. So I set up a very very simple Spring Boot 2.0.0 RC1 project with a web service and the /actuator/metrics endpoint exposed.

When GETting /actuator/metrics, I get the following keys: "http.server.requests" "jvm.buffer.count" "jvm.buffer.memory.used" "jvm.buffer.total.capacity" "jvm.gc.live.data.size" "jvm.gc.max.data.size" "jvm.gc.memory.allocated" "jvm.gc.memory.promoted" "jvm.gc.pause" "jvm.memory.committed" "jvm.memory.max" "jvm.memory.used" "jvm.threads.daemon" "jvm.threads.live" "jvm.threads.peak" "logback.events" "process.cpu.usage" "process.start.time" "process.uptime" "system.cpu.count" "system.cpu.usage" "system.load.average.1m"

None of them exhibits the problem I described :smile:

punkstarman commented 6 years ago

I'll have a look at what's involved in making a registry that would be suitable for MetricBeat.

ruflin commented 6 years ago

I personally would prefer id the dedot feature is more a last resort then the default from the Metricbeat perspective. Using dots which are converted to object can also be used in Elasticsearch to add special params to not have sub objects indexed for example: https://www.elastic.co/guide/en/elasticsearch/reference/current/object.html#object-params This becomes useful if there is lots of data and some if it is just meta information which does not need indexing.

One trick we apply in Metricbeat when we sometimes have a conflict is that we add .value to the entry. So if there is mem and mem.free, mem becomes mem.value as was also suggested here: https://github.com/micrometer-metrics/micrometer/issues/154#issuecomment-361931231

Long term I'm hoping we can standardise on these postfix / prefix so the UI can make use of it.

For the conflict of objects with keys I think there are 2 cases where it can happen:

The second case is only detected on ingest time into elasticsearch. But for the first case I think there should be some functionality in Metricbeat to resolve it like for example automatically adding .value or .count to the key. This would mean micrometer would not have to deal with it.

The dedot feature was initially introduced for labels in docker where I think it makes still sense. But in the context of actual metrics I'm now thinking we should add an option to convert an event with dotted keys to a valid json object (without dots in the keys). The main thing it requires is that in the event both conflicting keys exist always together.

Side note: Metricbeat 6.2 is now out.

jkschneider commented 6 years ago

@ruflin Wavefront's JSON-based ingest API had a similar structural problem in that a metrics payload was represented by a JSON map like this:

{
   "my.counter": { "tags": { "k": "v1"}, ... },
   "my.counter": { "tags": { "k": "v2"}, ...},
   ...
}

In this case, when we have two metrics that differ only by tags, the JSON is unparsable. So they solved this problem by having their ingest API strip any $ suffix on a metric name. So the payload winds up being something like this:

{
   "my.counter$1": { "tags": { "k": "v1"}, ... },
   "my.counter$2": { "tags": { "k": "v2"}, ...},
   ...
}

This works for them because their database's internal representation of tagged metrics doesn't have the same limitation as the JSON payload necessary to get it there.

Can we do something similar?

tkp1n commented 6 years ago

You may also want to take a look at: https://github.com/elastic/elasticsearch-metrics-reporter-java It is no longer maintained, but was actually a quite useful tool to get dropwizard metrics to ES directly.

I think it should be possible to provide an equivalent to solution to micrometer. Not sure if I'll find the time, but would a PR (adding micrometer/implementations/micormeter-registry-elastic) be considered?

jkschneider commented 6 years ago

Yes, micrometer-registry-elastic will exist at the end of this.

ruflin commented 6 years ago

@jkschneider I think the approach we should follow is that in case of a conflict we add .value or .counter as otherwise the metric names become a bit unreadable. It's in the end what the user types into his query.

@tkp1n For the suggestion of metrics reporter. It's kind of on the other end of what Metricbeat does today related to the data format. This is one event per metric and Metricbeat has one event for many metrics to have all metrics which were collected together in one place including additional meta information about environment etc. This makes it very powerful for correlation. It's unlikely that Metricbeat will change it's format but one could be converted into an other.

@tkp1n For the dropwizard metrics: Did you ever try the dropwizard module in Metricbeat?

tkp1n commented 6 years ago

@ruflin

including additional meta information about environment etc.

In micormeter this additional information would are stored in global or meter specific tags. In my PR all tags get reported to elasticsearch, thus allowing for the same kind of aggregations. The approach i took in my PR is basically to cut out the middleman (Metricbeat).

This makes it very powerful for correlation.

Can you provide an example of a correlation which is not possible with the approach taken in my PR? Would that kind of correlation be possible with the other metricbeat implementations/reporters?

For the dropwizard metrics: Did you ever try the dropwizard module in Metricbeat?

I did not, no. We are using a fork of above mentioned elasticsearch-metrics-reporter-java. Which sadly does not allow for much correlation. All the missing features I could think of, are addressed either in micrometer itself or my PR.

ruflin commented 6 years ago

@tkp1n Haven't look at the details of your PR yet. Please excuse if some of this is already in there.

For the meta data: The meta data I was referring to is for example information about the container or the k8s interface the data is coming from. We have kind of "standardised" on the naming of this data to make sure for example logs and metrics can be correlated. I assume you could add this meta data also to each single metric in your case. Can you share an example json doc that your PR sends to ES?

I'm ok with cutting out the middle man if this simplifies the workflow and setup. The part I'm worried about is that the metric structure which will end up in Elasticsearch will be quite different. This means if a user is using Metricbeat and Micrometer metrics he will have to interact rather differently with the data. The other part is about the additional functionality that Metricbeat provides like the docker or k8s enrichment of the metric data. I don't know enough about micrometer-metrics if that is also something it could add?

In any case, I don't think this discussion should block progress on https://github.com/micrometer-metrics/micrometer/pull/429 as there could also be 2 different ways to get data into ES.

jkschneider commented 6 years ago

@ruflin All good stuff. To the extent possible, we should match up structure. Data passing through MetricBeat vs. coming straight from Micrometer needs to be indistinguishable. We've got a history of doing this already. Data going straight to Datadog API (DatadogMeterRegistry) vs. through dogstatsd (StatsdMeterRegistry) should be indistinguishable (though perhaps a little extra richness with the former), allowing the user to flip between the implementations at their convenience.

information about the container or the k8s interface the data is coming from.

I'm guessing we can simply add the same data as common tags?

tkp1n commented 6 years ago

@ruflin & @jkschneider Wouldn't you consider adding 'information about the container or the k8s interface' a cross-cutting concern? I think we should focus on reporting to elasticsearch in this issue and consider filing a seperate (more general) one regarding some sort of automation for adding environmental information to the registry. This should benefit all users, not just the ones usign elastic.

Regarding the indistinguishable data format. I couldn't find any documentation on the reporting format of metricbeat to elasticsearch. I personally am not a big fan of reverse-engineering of and maintining compliance to a third-party data format. But I geuss it would be possible.

Currently the data format of PR #429 looks roughly as follows (one of the following records per meter and reporting interval):

{
    "index": {
        "_index": "metrics-*",
        "_type": "counter"
    }
}
{
    "@timestamp": "2001-07-04T12:08:56.235-0700",
    "name": "my_counter",
    "tag1": "value1",
    "tag2": "value2",
    "count": 123,
    "max": 456,
    "mean": 7.89,
    "sum": 9876
}

@ruflin Could you provide a sample as reported by metricbeat? I wonder how big of a difference there really is.

jkschneider commented 6 years ago

@tkp1n I looked at the metricbeat format myself earlier, and there isn't anything too special about it:

{
  "@timestamp": "2018-02-16T17:19:03.627Z",
  "@metadata": {
    "beat": "metricbeat",
    "type": "doc",
    "version": "6.2.0"
  },
  "metricset": {
    "name": "filesystem",
    "module": "system",
    "rtt": 4698
  },
  "system": {
    "filesystem": {
      "total": 0,
      "type": "autofs",
      "used": {
        "pct": 0,
        "bytes": 0
      },
      "free": 0,
      "available": 0,
      "files": 0,
      "device_name": "map auto_home",
      "mount_point": "\/home",
      "free_files": 0
    }
  },
  "beat": {
    "name": "Jons-MBP",
    "hostname": "Jons-MBP",
    "version": "6.2.0"
  }
}

It does appear that it takes a metric name like system.filesystem.total and nests it in JSON.

tkp1n commented 6 years ago

@jkschneider Indeed nothing too fancy. It raises some questions however:

It appears to me that the format proposed in #429 is both simpler to construct and capable of handling more information per metric. I'm not a Kibana specialist, but it also appears to be easier to filter, aggregate and analyze.

Am I missing something here? I really don't understand why they have chosen that format..

ruflin commented 6 years ago

As @jkschneider already showed a good example of a metricbeat event. Here is an other one for docker containers: https://github.com/elastic/beats/blob/master/metricbeat/module/docker/container/_meta/data.json In the Github repo there is a json example for each event generated. The reason I linked the docker example is that the above system event could be enriched with the docker meta data in case it's running inside docker.

I think what you have as tags above is what we have in the fields: {} namespace or in docker it is labels: ....

The format in metricbeat defines for each field the type. In your example above, sum is an integer. In some cases sum can also be a float so if you try to ingest an event with a sum: 1.7 you will get an error. The metricbeat format also allows to store keywords or text as metrics. That ES allows to not only store numbers as metrics I personally see as an advantage of ES.

The main advantage of the metricbeat format on the Kibana side is that a user can look at all metrics from one point in time at once. It's like looking at the API response of a metric endpoint that normally exposes multiple metrics at once. This makes it possible for a users to see if there is some correlation between different metrics misbehaving. From a aggregation / filtering / anlyzing perspective I think there is not much difference between the two.

The main challenge is collision but so far hasn't been an issue in metricbeat because we were in full control of the generated data so far. This also solved so far the multitude of values issue. Having added the http endpoint in metricbeat it obviously brings new challenges which I think are solvable through some conventions. In the end the event created by #429 is also a convention. I started working on such a convention some time ago but haven't given it much priority, it seems this discussion changes the priority a bit here :-)

jkschneider commented 6 years ago

We should check this out too: https://github.com/Yelp/elastalert

jkschneider commented 6 years ago

Merged in 3ec44d958736b0e4c6979479cbc2982fd491384b