hoffie / multilog_exporter

Watches log files and exposes data as Prometheus metrics
Apache License 2.0
14 stars 4 forks source link

Fix 'now()' not working on label values #1

Closed kir4h closed 4 years ago

kir4h commented 4 years ago

First of all, thanks for your project! While browsing for a way to generate metrics from arbitrary logs, I didn't want to go to complex/grok based solutions since my requirement was pretty simple. I thought about writing this functionality myself only to find out someone had already done so :)

According to the documentation, the function now() should be valid for label values:

Supported values (for metric results and label values):

  • Static numbers, such as 1, 10.3e5
  • Regex group references, such as $pool from the regexp pool=(?P<pool>[^ ]+) msg
  • now() which returns the current Unix time

This PR makes now() evaluated also for labels.

I have split the PR into 3 commits:

Feel free to drop any commit or let me know if any changes required/suggested.

hoffie commented 4 years ago

Thanks for the PR, looks interesting, will have a deeper look in the coming days. :)

hoffie commented 4 years ago

I really like it code-wise, thanks for your effort!

I am a bit unsure about the semantics. Prometheus-wise, such metrics are going to be problematic, aren't they? Having a timestamp as a label might well lead to cardinality explosion.

Can you describe what the use case for this is? Based on that, we can better judge whether to merge this as-is or if there are any other options for achieving your goals.

I agree that something has to be fixed -- the docs are currently wrong. Either the feature gets implemented or the docs will have to be corrected.

Let me know what you think.

kir4h commented 4 years ago

am a bit unsure about the semantics. Prometheus-wise, such metrics are going to be problematic, aren't they? Having a timestamp as a label might well lead to cardinality explosion.

Indeed! I realized this some days after raising my PR, while making usage of the output :)

Can you describe what the use case for this is?

Scenario My journey began with a set of shell scripts I wanted to instrument. I started using Pushgateway for this, by adding a small logic to some of them to push their metrics. But then I realized I shouldn't "contaminate" the scripts with prometheus awareness, and ended up thinking that "something" should derive that info from the logs and generate those metrics (that's how I ended up using this project)

When running such scripts regularly through cron jobs, a nice metric is being aware on when the script was successfully executed for the last time (so that I can generate an alert for instance if it doesn't happen when it's supposed to).

While using this project, I realized I had lost one of the goodies from pushgateway: for every pushed metric for a given group, generates additional metrics (push_time_seconds and push_failure_time_seconds) that allowed knowing when was the last time the metric was reported.

First approach

In order to overcome this, I though of using labels for my metric. Whenever a regexp was matched in a log file, and that a corresponding metrics was to be issued, a label was added to the timestamp. This way, I could read this value in Prometheus/Grafana and do something about it.

It was my first contact with metric generation, so only later I realized the cardinality issue: Each metric, because of having different label value for timestamp, was increasing cardinality, and this would continue happening on every metric issuing. At that point I should have come back and commented it in this PR, but it didn't cross my mind, sorry!

Second approach

My second (and current approach) has been to add a "sidecar" metric to every match, that reports this value

logs:
- path: /logs/my.log
  patterns:
  - match: .*total time execution:\s*(?P<seconds>\d+)\s*seconds
    metric: duration_seconds
    type: gauge
    help: Execution time in seconds
    action: set
    continue: true
    value: $seconds
    labels:
      instance: myInstance
      job: myJobName
  - match: .*total time execution:\s*(?P<seconds>\d+)\s*seconds
    metric: last_execution
    type: gauge
    help: Last execution time in seconds
    action: set
    continue: true
    value: now()
    labels:
      instance: myInstance
      job: myJobName

This isn't ideal because for every match, I need to inject a repetitive block that issues this "sidecar metric"

Possible third approach

I was thinking that a possible approach would be multilog_exporter mimicking pushgateway behaviour, having a boolean flag that when set would produce an additional metric reporting the last time the metric was updated for a given group.

I still have little Prometheus knowledge so I might be missing something, but I guess if Pushgateway does this is because there is a legitimate use case for it.

What do you think?

hoffie commented 4 years ago

Your second approach would be what I have suggested with the current version, yes. :)

I was thinking that a possible approach would be multilog_exporter mimicking pushgateway behaviour, having a boolean flag that when set would produce an additional metric reporting the last time the metric was updated for a given group.

This sounds like a useful enhancement. Would probably nice to be able to control that by metric? Would you want to work on a PR for that one? Guess a new one would make sense.

kir4h commented 4 years ago

Closing this in favour of #3