influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

Outputs.Stackdriver `tags_as_resource_labels` config option doesn't appear to be working as expected #13911

Closed crflanigan closed 1 year ago

crflanigan commented 1 year ago

Relevant telegraf.conf

## Relevant Agent Configuration

[inputs.mem]
  [inputs.mem.tags]
    job = "inputs.mem"

[inputs.processes]
  [inputs.processes.tags]
    job = "inputs.processes"

[[outputs.opentelemetry]]
  service_address = "server:4317"

## Relevant Server Config
## Server (Values that are {{}} are hydrated by a key value store when
## Telegraf is deployed):

[global_tags]

# Configuration for telegraf agent
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 199
  metric_buffer_limit = {{.METRIC_BUFFER}}
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = {{.DEBUG}}
  quiet = false

# Receive OpenTelemetry traces, metrics, and logs over gRPC
[[inputs.opentelemetry]]
  ## Override the default (0.0.0.0:4317) destination OpenTelemetry gRPC service
  ## address:port
  service_address = "0.0.0.0:4317"

[[processors.regex]]
  # Rename metric names (measurement names) that don't match Prometheus requirements
  [[processors.regex.metric_rename]]
    pattern = "[^a-zA-Z0-9_]+"
    replacement = "_"

  # Rename tag keys that don't match Prometheus requirements
  [[processors.regex.tag_rename]]
    pattern = "[^a-zA-Z0-9_]+"
    replacement = "_"

  # Rename field keys that don't match Prometheus requirements
  [[processors.regex.field_rename]]
    pattern = "[^a-zA-Z0-9_]+"
    replacement = "_"

    # Configuration for sending metrics to GMP
[[outputs.stackdriver]]
  project = "{{.PROJECT_ID}}"
  resource_type = "prometheus_target"
  metric_name_format = "official"
  metric_data_type = "double"
  metric_type_prefix = "prometheus.googleapis.com"
  tags_as_resource_label = ["instance", "job"]
  # Ignore metrics from inputs.internal
  namedrop = ["internal_*"]
  [outputs.stackdriver.resource_labels]
    cluster = "{{.CLUSTER_NAME}}"
    job = "Telegraf"
    instance = "{{.CLUSTER_NAME}}"
    location = "{{.LOCATION}}"
    namespace = "{{.NAMESPACE_LABEL}}"

Logs from Telegraf

The logs are unremarkable, but when outputting what the agent is sending to outputs.stackdriver using outputs.file I see that the tags are set correctly.

Example: Memory metric has job=inputs.mem Processes metric has job=inputs.processes

mem,env=Production_MacOS,host=hostname,instance=hostname,job=inputs.mem active=965160960i,available=1003286528i,used_percent=94.1601037979126,available_percent=5.839896202087402,inactive=954462208i,wired=693956608i,total=17179869184i,used=16176582656i,free=48824320i 1694460660000000000
processes,env=Production_MacOS,host=hostname,instance=hostname,job=inputs.processes blocked=0i,zombies=1i,stopped=0i,running=3i,sleeping=428i,total=432i,unknown=0i,idle=0i 1694460660000000000

But when I look at the metrics in Google or Grafana I notice that it looks like the metrics are sometimes tagged with another inputs tag, IE job=inputs.mem but for a processes metric. From what I can tell it's usually the busiest metric. Additionally, it appears to be the last metric's tag that is sent is the one that is used for all the entire batch of metrics. Including metrics that don't contain a job tag. Meaning that the default value isn't applied, instead whatever the last value for the tag is used.

### System info

1.27.4

### Docker

_No response_

Steps to reproduce

  1. Send metrics from a Telegraf client with a tag called job configured for each input plugin to a Telegraf server
  2. Configure the Telegraf server such that it uses tags_as_resource_labels for job tag.
  3. Send metrics to Google ...

Expected behavior

Each metric will have their job tag (or any other applicable tag) applied as a resource label.

Actual behavior

The tag appears to change depending on what the last tag was sent. As you can see in the screenshot the job tag changed between inputs.mem and inputs.processes, even though the sending agent didn't change. Here's an example of it changing applying a mismatched job label on two different inputs:

image

image

What also seems weird is that you can stop sending a tag altogether and the last sent tag is what continues to be sent instead of the default value:

[inputs.mem]
  # [inputs.mem.tags]
  #   job = "inputs.mem"

[inputs.processes]
  # [inputs.processes.tags]
  #   job = "inputs.processes"

Telegraf Client outputs.file output showing no job tag:

mem,env=Production_MacOS,host=hostname,instance=hostname total=17179869184i,available_percent=5.273199081420898,active=885673984i,free=24256512i,wired=659767296i,available=905928704i,used=16273940480i,used_percent=94.7268009185791,inactive=881672192i 1694532100000000000
image

Additional info

No response

crflanigan commented 1 year ago

@powersj Created per your request.

crflanigan commented 1 year ago

FWIW I duplicated this behavior on a standard Ubuntu server as well, in addition to our Kubernetes Telegraf deployment.

Sent with no job tag on the metric (which defaults initially to whatever you have in the job field), then with the job tag (which changed to the job tag of the busiest metric), then with only one input having a tag (which creates a new time series with the different tag).

image
powersj commented 1 year ago

See #13912 artifacts in 20-30mins

crflanigan commented 1 year ago

Thanks!

crflanigan commented 1 year ago

Ok,

I think I see the logs you're looking for:

Creating timeseries:
mem_active map[env:Production_MacOS host:hostname] map[gauge:1370566656] 1694536580000000000
metric labels: map[env:Production_MacOS host:hostname]
resource labels: map[cluster:onprem instance:hostname job:inputs.mem location:us-east1-b namespace:store_pc_metrics project_id:io1-sandbox]

This looks right, but when I look at the metric in Google I see the job resource label as inputs.disk:

image
powersj commented 1 year ago

That does look right from telegraf. Is this something you could ask Google? Happy to work through this further, but I'm not sure what else to look at given the labels do look right.

crflanigan commented 1 year ago

I sure can! Is there any reason why this time series would have its label changed as it's being sent?

powersj commented 1 year ago

I wondered, but nothing catches my eye. Once we create the time series we add it to bucket for each metric. Then we split the buckets into 200 at a time and send them, see here.

Let me know what you hear.

crflanigan commented 1 year ago

From Google:

The only things I can think of are either 1. Telegraf logging isn't correct, and it's sending the wrong label to us (maybe you can check by using Wireshark or something) or 2. it's sending a different metric name when you mutate job, so your query isn't retrieving the right data or 3. the new data is being rejected for some unrelated reason, causing it to not make it into our system at all. Kinda just stabs in the dark...

crflanigan commented 1 year ago

One other thing that I should emphasize is that once a tag comes through which sets the tag_as_resource_label setting (like job), even if I stop the Telegraf client and send without the job tag, the resource label is still showing in Google with whatever it was set as last (job="inputs.disk").

powersj commented 1 year ago

even if I stop the Telegraf client and send without the job tag, the resource label is still showing in Google with whatever it was set as last (job="inputs.disk").

Doesn't that eliminate telegraf as the cause?

Out of his options I would second the wireshark if we really do think it is a telegraf issue.

crflanigan commented 1 year ago

Doesn't that eliminate telegraf as the cause?

I mean, not sure. The problem resolves if I restart the agent. It almost seems like when sending it is set to whatever it was set to last.

Out of his options I would second the wireshark if we really do think it is a telegraf issue.

I agree, just not sure how to decrypt the traffic since this is sending to Google. Would be lovely to output the data stream to a file instead of to Google or something.

crflanigan commented 1 year ago

Seeing if I can get some logging from Google of what we're sending.

powersj commented 1 year ago

I pushed some additional logging right before we send the request. The only thing I would wonder is if something is getting mixed up during the request creation.

crflanigan commented 1 year ago

Hi @powersj,

Here's the response from Google:

I'm seeing the following in the logs:

prometheus.googleapis.com/mem_available_gauge/gauge io1-sandbox us-east1-b onprem store_pc_metrics inputs.processes HOSTNAME REDACTED

I'm also seeing io1-sandbox us-east1-b onprem store_pc_metrics inputs.mem HOSTNAME REDACTED

and

io1-sandbox us-east1-b onprem store_pc_metrics inputs.disk HOSTNAME REDACTED

So it looks like all 3 values are making it in... however I can't really reconstruct any time series through the logs so all I can really say is that over the last few days we have seen all 3 values.

It does look like you can see all 3 values in Metrics Explorer as well. inputs.processes and inputs.mem are only transiently seen, however - the most recent data is for inputs.disk:

image

crflanigan commented 1 year ago

I pushed some additional logging right before we send the request. The only thing I would wonder is if something is getting mixed up during the request creation.

Sure, I can test it!

crflanigan commented 1 year ago

Ok, I tested it.

There's a lot of new lines that look like this:

metric: type:"prometheus.googleapis.com/disk_used_percent_gauge/gauge" labels:{key:"device" value:"disk3s6"} labels:{key:"env" value:"Production_MacOS"} labels:{key:"fstype" value:"apfs"} labels:{key:"host" value:"hostname"} labels:{key:"mode" value:"rw"} labels:{key:"path" value:"/System/Volumes/VM"} points: [interval:{end_time:{seconds:1694630580}} value:{double_value:29.258641050723256}]

Is that what you wanted? Did you want me to email you the log file, if so, how long did you want me to let the agent run?

powersj commented 1 year ago

@crflanigan my apologies for missing this

What I was going to ask next is to go through the logs and see if you could find any examples of a metric with the wrong label.

Feel free to mail me the logs as well and I can look through as well.

crflanigan commented 1 year ago

Hi @powersj,

When I looked at the logging I actually didn't see any of the tags represented (as shown above).

One thing to keep in mind though is that if I send two metrics with different job tags, whatever the last metric that was sent will be the job tag for both. Additionally, if I disable the job tag for one of the input plugins (comment it out), while the other one is still active, the metric without a job tag defined will get the job label for the other metric. On top of that, if you disable the job tag from both plugins, it still persists in sending the last defined tag (in this case inputs.disk). Example:

# Note that inputs.disk generates far more metrics than inputs.mem and will have more job tags
[inputs.disk]
  [inputs.disk.tags]
      job = "inputs.disk"

[inputs.mem] 
  [inputs.mem.tags]
      job = "inputs.mem"

Note: The inputs.disk and inputs.mem values in the attached images are the values for the job resource label:

image image

As you can see, there's the correct label for inputs.mem, because it worked correctly once, but then the more aggressive inputs.disk metric came and took presidence.

After that I went ahead and disabled the job tag for inputs.mem:

[inputs.disk]
  [inputs.disk.tags]
      job = "inputs.disk"

[inputs.mem] 
  #[inputs.mem.tags]
      #job = "inputs.mem"

And even still the job label was applied to the inputs.mem metric:

image

Then I go ahead and remove the job label from both metrics and still the job label persists. Here's the outputs.file metric before it's sent to the Telegraf proxy (outputs.stackdriver) server (note no job label):

disk,device=disk3s5,env=Production_MacOS,fstype=apfs,host=host,instance=host,mode=rw,path=/System/Volumes/Data total=494384795648i,free=358247702528i,used=136137093120i,used_percent=27.536666644766534,inodes_total=3500345821i,inodes_free=3498512720i,inodes_used=1833101i 1695658020000000000
mem,env=Production_MacOS,host=host,instance=host available=827138048i,used=16352731136i,active=822763520i,wired=989003776i,total=17179869184i,used_percent=95.18542289733887,available_percent=4.814577102661133,free=7979008i,inactive=819159040i 1695658020000000000

But the inputs.disk job label is still persisting.

image

Does that help?

crflanigan commented 1 year ago

Sent you the logs via email.

Thanks again!

powersj commented 1 year ago

Does that help?

Unfortunately, I am not following. Let's look back and the code and make sure my understanding is correct:

Resource labels are generated at each metric, however there is another place where it is set initially via the resource_label config option. You have the following:

  [outputs.stackdriver.resource_labels]
    cluster = "{{.CLUSTER_NAME}}"
    job = "Telegraf"
    instance = "{{.CLUSTER_NAME}}"
    location = "{{.LOCATION}}"
    namespace = "{{.NAMESPACE_LABEL}}"

This means that all metrics should get the above resource labels no matter what. You also have the following tags_as_resource_labels set:

  tags_as_resource_label = ["instance", "job"]

Which means if we come across a metric with a tag called instance or job we will set these resource labels as well and remove them from metric labels.

There is no sharing between metrics beyond what s.ResourceLabels is set to in your config. Once created, the time series is not modified in anyway either by Telegraf.

Can you please share your metrics explorer query. Are you sure you aren't grouping by job?

crflanigan commented 1 year ago

Sure!

Here's some additional information from both the Metrics Explorer and Grafana:

First, here's the Metrics Explorer:

image

Here's the output from the query inspector in Grafana, note the mem_active_gauge metric being populated with inputs.disk:

image image

Does that help?

Thanks!

powersj commented 1 year ago

That last image does shows me that the memory metric does in fact have the job input. How it got there though is not obvious.

When user shows up with an issue in grafana or the influxdb UI, we usually tell them that we need to reproduce this or demonstrate it with Telegraf. In your case, we have established that using the outputs.file plugin that metrics are returned with the correct job tag.

With the stackdriver output, I have outlined how each metric gets resource labels and we have printed out what they got set to correct. That said let's talk about your query:

Going back to your PromQL query, you are using just the metric name. Based on the Time series Selectors docs it says using a single metric name returns "all time series that have the... name". When you are pushing data your data via stackdriver it is sent as a timeseries and a timeseries can have many metrics. Sure enough if I search for just my disk value, I get all series that I pushed yesterday, both disk and memory:

image

Again this is because I was pushing one time series with multiple metrics. Once I specify which job tag it correctly returns a single value:

image

I do not see this same behavior with the builder:

image

It is not clear to me what to do here: is this a matter of being specific in your query? or has the way we have sent metrics for 5 years been wrong? Or is the use of PromQL requiring us to change how we push data?

crflanigan commented 1 year ago

Yeah, it's kind of a conundrum. I guess I would say that Google is also seeing this in their logs:


I'm seeing the following in the logs:

"[prometheus.googleapis.com/mem_available_gauge/gauge 
io1-sandbox
us-east1-b
onprem
store_pc_metrics
inputs.processes
hostname

I'm also seeing
io1-sandbox
us-east1-b
onprem
store_pc_metrics
inputs.mem
hostname

and 

io1-sandbox
us-east1-b
onprem
store_pc_metrics
inputs.disk
hostname

So it looks like all 3 values are making it in... however I can't really reconstruct any time series through the logs so all I can really say is that over the last few days we have seen all 3 values.


What that says to me is that even if I was specific in the query, the data appears to be labled differently.

I wish I could use Wireshark to inspect the packets, but all the data is encrypted. Is it easy to write what stackdriver plans on sending to file instead of sending?

powersj commented 1 year ago

I'm seeing the following in the logs:

These are all the resource labels attached to a certain metric? Do we have timestamps to ensure that these are all at the same time? And not just a one-off typo or misconfiguration that made it through?

I wish I could use Wireshark to inspect the packets, but all the data is encrypted. Is it easy to write what stackdriver plans on sending to file instead of sending?

I can put a print even closer to the request or try using a debugger to inspect what the request looks like.

What I would prefer is to see Google documentation that show the method we are currently using should be different or the approach used is wrong. Looking at golang-samples it seems to line up with what we do, as does the go-metrics-stackdriver.

powersj commented 1 year ago

Well the debugger might be our savior here. Makes me wonder who is actually using this plugin besides you ;)

# created and now adding time series
metric:{type:"test_mem_value/unknown"}  resource:{type:"global"  labels:{key:"job"  value:"mem"}  labels:{key:"project_id"  value:"projects/[PROJECT]"}}  metric_kind:GAUGE  points:{interval:{end_time:{seconds:1694386800}}  value:{double_value:100}}
map[job:disk project_id:projects/[PROJECT]]

# created and now adding time series
metric:{type:"test_disk_value/unknown"}  resource:{type:"global"  labels:{key:"job"  value:"disk"}  labels:{key:"project_id"  value:"projects/[PROJECT]"}}  metric_kind:GAUGE  points:{interval:{end_time:{seconds:1694386800}}  value:{double_value:42}}

# state of the time series right before we send:
[metric:{type:"test_mem_value/unknown"}  resource:{type:"global"  labels:{key:"job"  value:"disk"}  labels:{key:"project_id"  value:"projects/[PROJECT]"}}  metric_kind:GAUGE  points:{interval:{end_time:{seconds:1694386800}}  value:{double_value:100}}]
[metric:{type:"test_disk_value/unknown"}  resource:{type:"global"  labels:{key:"job"  value:"disk"}  labels:{key:"project_id"  value:"projects/[PROJECT]"}}  metric_kind:GAUGE  points:{interval:{end_time:{seconds:1694386800}}  value:{double_value:42}}]
crflanigan commented 1 year ago

Makes me wonder who is actually using this plugin besides you ;)

ROFL! Me too!

Am I reading this right? It looks like it was mem metric got the disk metric's label? If so, WOOOOHOOO! Is there anything I can do to help?

powersj commented 1 year ago

:facepalm: thou shall not shallow copy a map - I'll have a PR tomorrow

crflanigan commented 1 year ago

🤦 thou shall not shallow copy a map - I'll have a PR tomorrow

Can't wait to test it! Thanks for fixing this for us!

powersj commented 1 year ago

Alright, #14009 is ready for your testing ;)

crflanigan commented 1 year ago

Will do! I have a deployment today of Telegraf server, but will test after that. : )

Thanks again!

crflanigan commented 1 year ago

I lied, I couldn't wait, lol.

Initial testing looks good. I will have to do a deep dive of the metrics, but so far so good!

crflanigan commented 1 year ago

Testing performed:

Send metrics in from two plugins, inputs.disk and inputs.mem as shown below, with a global instance tag with my computer name:

[inputs.disk]
  [inputs.disk.tags]
    job = "inputs.disk"

[inputs.mem]
  [inputs.mem.tags]
    job = "inputs.mem"

Then I stopped sending the job tag/label in inputs.mem, which started sending the default values as expected. The job names were set correctly, though I did see one thing that may not be a huge deal, but a bit odd. If the sending agent stops sending the job tag , and then starts again the first metric that is sent appears to be the default value (and in the case of the red time series it looks like the instance label also reverted briefly).

image

I then turned on inputs.processes to make the configuration more aggressive and noticed the same thing (also noticed that the instance was missing to):

image

It's almost like the first time it encounters the metric it doesn't inherit the tag as a resource label, but after that it works great. I'm guessing the impact here would be if the Telegraf server restarts the first metrics it receives will be wrong, not really a big deal as a single data point is pretty insignificant, but thought I would call it out. Another thing I noticed was that for that initial metric it also drops the tags associated with tags_as_resource_label options while not applying them to the resource labels. I guess the impact would be that if the Telegraf server were to be restarted the first metrics received would be labeled incorrectly, which depending on your metrics velocity could indicate an outage.

image

That said, otherwise this is working great!

crflanigan commented 1 year ago

I'm going to continue playing with this, but that's my inital take. Any idea why this might be happening?

Thanks!

powersj commented 1 year ago

if the sending agent stops sending the job tag , and then starts again the first metric that is sent appears to be the default value (and in the case of the red time series it looks like the instance label also reverted briefly).

when you say stops sending the job tag, you mean you edit the config to remove the additional tag and restart telegraf to pick up the new config?

In that case, the metric coming in would not have a job tag, so the job from your resource labels config would get used:

  [outputs.stackdriver.resource_labels]
    job = "Telegraf"

Am I following?

crflanigan commented 1 year ago

I'm still fleshing this out, but the inital metric sent will have the default values in the config:

  [outputs.stackdriver.resource_labels]
    job = "Telegraf"
    instance = "test"

The first data point has the instance and job tag dropped and the resource labels are the default values. Then it works correctly. As far restarting without the tag and the adding it back, I have seen that it can also have the first data point be the default values again.

That's what I'm seeing so far, does that help?

Thanks!

powersj commented 1 year ago

the inital metric sent will have the default values in the config:

If your metric has a tag that matches the values in tags_as_resource_label then it should include those tags. There should not be a delay or a situation where metric 1 is different than n+1.

I would add the [[outputs.file]] plugin again and see what metrics you think you are getting and verify that they do in fact have the tags

crflanigan commented 1 year ago

Strangely, I suspended testing, turned off the Telegraf Server and Client, ate lunch, came back, and when I turned them both on again it's working fine. That said, I never disabled sending the instance tag but it was indeed missing, weird.

It seems to be working now though.

powersj commented 1 year ago

ate lunch, came back

heh the best type of debugging :)

crflanigan commented 1 year ago

Yeah, I think this is working. I might have just been confused or something, not sure what happened with the instance the first time, but I think this looks good. Do you think this will make it into the Oct 2nd release?

powersj commented 1 year ago

Do you think this will make it into the Oct 2nd release?

yep!