Fix 'now()' not working on label values

kir4h commented 4 years ago

First of all, thanks for your project! While browsing for a way to generate metrics from arbitrary logs, I didn't want to go to complex/grok based solutions since my requirement was pretty simple. I thought about writing this functionality myself only to find out someone had already done so :)

According to the documentation, the function now() should be valid for label values:

Supported values (for metric results and label values):

Static numbers, such as 1, 10.3e5

Regex group references, such as $pool from the regexp pool=(?P<pool>[^ ]+) msg

now() which returns the current Unix time

This PR makes now() evaluated also for labels.

I have split the PR into 3 commits:

The fix itself
An amendment to the test of the now() usage in metric values. Please correct me if I'm wrong or misunderstood the test
Updating of the packages through dep ensure

Feel free to drop any commit or let me know if any changes required/suggested.

hoffie commented 4 years ago

Thanks for the PR, looks interesting, will have a deeper look in the coming days. :)

hoffie commented 4 years ago

I really like it code-wise, thanks for your effort!

I am a bit unsure about the semantics. Prometheus-wise, such metrics are going to be problematic, aren't they? Having a timestamp as a label might well lead to cardinality explosion.

Can you describe what the use case for this is? Based on that, we can better judge whether to merge this as-is or if there are any other options for achieving your goals.

I agree that something has to be fixed -- the docs are currently wrong. Either the feature gets implemented or the docs will have to be corrected.

Let me know what you think.

kir4h commented 4 years ago

am a bit unsure about the semantics. Prometheus-wise, such metrics are going to be problematic, aren't they? Having a timestamp as a label might well lead to cardinality explosion.

Indeed! I realized this some days after raising my PR, while making usage of the output :)

Can you describe what the use case for this is?

Scenario My journey began with a set of shell scripts I wanted to instrument. I started using Pushgateway for this, by adding a small logic to some of them to push their metrics. But then I realized I shouldn't "contaminate" the scripts with prometheus awareness, and ended up thinking that "something" should derive that info from the logs and generate those metrics (that's how I ended up using this project)

When running such scripts regularly through cron jobs, a nice metric is being aware on when the script was successfully executed for the last time (so that I can generate an alert for instance if it doesn't happen when it's supposed to).

While using this project, I realized I had lost one of the goodies from pushgateway: for every pushed metric for a given group, generates additional metrics (push_time_seconds and push_failure_time_seconds) that allowed knowing when was the last time the metric was reported.

First approach

In order to overcome this, I though of using labels for my metric. Whenever a regexp was matched in a log file, and that a corresponding metrics was to be issued, a label was added to the timestamp. This way, I could read this value in Prometheus/Grafana and do something about it.

It was my first contact with metric generation, so only later I realized the cardinality issue: Each metric, because of having different label value for timestamp, was increasing cardinality, and this would continue happening on every metric issuing. At that point I should have come back and commented it in this PR, but it didn't cross my mind, sorry!

Second approach

My second (and current approach) has been to add a "sidecar" metric to every match, that reports this value

logs:
- path: /logs/my.log
  patterns:
  - match: .*total time execution:\s*(?P<seconds>\d+)\s*seconds
    metric: duration_seconds
    type: gauge
    help: Execution time in seconds
    action: set
    continue: true
    value: $seconds
    labels:
      instance: myInstance
      job: myJobName
  - match: .*total time execution:\s*(?P<seconds>\d+)\s*seconds
    metric: last_execution
    type: gauge
    help: Last execution time in seconds
    action: set
    continue: true
    value: now()
    labels:
      instance: myInstance
      job: myJobName

This isn't ideal because for every match, I need to inject a repetitive block that issues this "sidecar metric"

Possible third approach

I was thinking that a possible approach would be multilog_exporter mimicking pushgateway behaviour, having a boolean flag that when set would produce an additional metric reporting the last time the metric was updated for a given group.

I still have little Prometheus knowledge so I might be missing something, but I guess if Pushgateway does this is because there is a legitimate use case for it.

What do you think?

hoffie commented 4 years ago

Your second approach would be what I have suggested with the current version, yes. :)

I was thinking that a possible approach would be multilog_exporter mimicking pushgateway behaviour, having a boolean flag that when set would produce an additional metric reporting the last time the metric was updated for a given group.

This sounds like a useful enhancement. Would probably nice to be able to control that by metric? Would you want to work on a PR for that one? Guess a new one would make sense.

kir4h commented 4 years ago

Closing this in favour of #3

hoffie / multilog_exporter

Fix 'now()' not working on label values #1