Closed crflanigan closed 1 year ago
@powersj Created per your request.
FWIW I duplicated this behavior on a standard Ubuntu server as well, in addition to our Kubernetes Telegraf deployment.
Sent with no job tag on the metric (which defaults initially to whatever you have in the job field), then with the job tag (which changed to the job tag of the busiest metric), then with only one input having a tag (which creates a new time series with the different tag).
See #13912 artifacts in 20-30mins
Thanks!
Ok,
I think I see the logs you're looking for:
Creating timeseries:
mem_active map[env:Production_MacOS host:hostname] map[gauge:1370566656] 1694536580000000000
metric labels: map[env:Production_MacOS host:hostname]
resource labels: map[cluster:onprem instance:hostname job:inputs.mem location:us-east1-b namespace:store_pc_metrics project_id:io1-sandbox]
This looks right, but when I look at the metric in Google I see the job
resource label as inputs.disk
:
That does look right from telegraf. Is this something you could ask Google? Happy to work through this further, but I'm not sure what else to look at given the labels do look right.
I sure can! Is there any reason why this time series would have its label changed as it's being sent?
I wondered, but nothing catches my eye. Once we create the time series we add it to bucket for each metric. Then we split the buckets into 200 at a time and send them, see here.
Let me know what you hear.
From Google:
The only things I can think of are either 1. Telegraf logging isn't correct, and it's sending the wrong label to us (maybe you can check by using Wireshark or something) or 2. it's sending a different metric name when you mutate job, so your query isn't retrieving the right data or 3. the new data is being rejected for some unrelated reason, causing it to not make it into our system at all. Kinda just stabs in the dark...
One other thing that I should emphasize is that once a tag comes through which sets the tag_as_resource_label
setting (like job), even if I stop the Telegraf client and send without the job tag, the resource label is still showing in Google with whatever it was set as last (job="inputs.disk").
even if I stop the Telegraf client and send without the job tag, the resource label is still showing in Google with whatever it was set as last (job="inputs.disk").
Doesn't that eliminate telegraf as the cause?
Out of his options I would second the wireshark if we really do think it is a telegraf issue.
Doesn't that eliminate telegraf as the cause?
I mean, not sure. The problem resolves if I restart the agent. It almost seems like when sending it is set to whatever it was set to last.
Out of his options I would second the wireshark if we really do think it is a telegraf issue.
I agree, just not sure how to decrypt the traffic since this is sending to Google. Would be lovely to output the data stream to a file instead of to Google or something.
Seeing if I can get some logging from Google of what we're sending.
I pushed some additional logging right before we send the request. The only thing I would wonder is if something is getting mixed up during the request creation.
Hi @powersj,
Here's the response from Google:
I'm seeing the following in the logs:
prometheus.googleapis.com/mem_available_gauge/gauge io1-sandbox us-east1-b onprem store_pc_metrics inputs.processes HOSTNAME REDACTED
I'm also seeing io1-sandbox us-east1-b onprem store_pc_metrics inputs.mem HOSTNAME REDACTED
and
io1-sandbox us-east1-b onprem store_pc_metrics inputs.disk HOSTNAME REDACTED
So it looks like all 3 values are making it in... however I can't really reconstruct any time series through the logs so all I can really say is that over the last few days we have seen all 3 values.
It does look like you can see all 3 values in Metrics Explorer as well. inputs.processes and inputs.mem are only transiently seen, however - the most recent data is for inputs.disk:
I pushed some additional logging right before we send the request. The only thing I would wonder is if something is getting mixed up during the request creation.
Sure, I can test it!
Ok, I tested it.
There's a lot of new lines that look like this:
metric: type:"prometheus.googleapis.com/disk_used_percent_gauge/gauge" labels:{key:"device" value:"disk3s6"} labels:{key:"env" value:"Production_MacOS"} labels:{key:"fstype" value:"apfs"} labels:{key:"host" value:"hostname"} labels:{key:"mode" value:"rw"} labels:{key:"path" value:"/System/Volumes/VM"} points: [interval:{end_time:{seconds:1694630580}} value:{double_value:29.258641050723256}]
Is that what you wanted? Did you want me to email you the log file, if so, how long did you want me to let the agent run?
@crflanigan my apologies for missing this
What I was going to ask next is to go through the logs and see if you could find any examples of a metric with the wrong label.
Feel free to mail me the logs as well and I can look through as well.
Hi @powersj,
When I looked at the logging I actually didn't see any of the tags represented (as shown above).
One thing to keep in mind though is that if I send two metrics with different job tags, whatever the last metric that was sent will be the job tag for both. Additionally, if I disable the job tag for one of the input plugins (comment it out), while the other one is still active, the metric without a job tag defined will get the job label for the other metric. On top of that, if you disable the job tag from both plugins, it still persists in sending the last defined tag (in this case inputs.disk). Example:
# Note that inputs.disk generates far more metrics than inputs.mem and will have more job tags
[inputs.disk]
[inputs.disk.tags]
job = "inputs.disk"
[inputs.mem]
[inputs.mem.tags]
job = "inputs.mem"
Note: The inputs.disk and inputs.mem values in the attached images are the values for the job resource label:
As you can see, there's the correct label for inputs.mem, because it worked correctly once, but then the more aggressive inputs.disk metric came and took presidence.
After that I went ahead and disabled the job tag for inputs.mem:
[inputs.disk]
[inputs.disk.tags]
job = "inputs.disk"
[inputs.mem]
#[inputs.mem.tags]
#job = "inputs.mem"
And even still the job label was applied to the inputs.mem metric:
Then I go ahead and remove the job label from both metrics and still the job label persists. Here's the outputs.file metric before it's sent to the Telegraf proxy (outputs.stackdriver) server (note no job label):
disk,device=disk3s5,env=Production_MacOS,fstype=apfs,host=host,instance=host,mode=rw,path=/System/Volumes/Data total=494384795648i,free=358247702528i,used=136137093120i,used_percent=27.536666644766534,inodes_total=3500345821i,inodes_free=3498512720i,inodes_used=1833101i 1695658020000000000
mem,env=Production_MacOS,host=host,instance=host available=827138048i,used=16352731136i,active=822763520i,wired=989003776i,total=17179869184i,used_percent=95.18542289733887,available_percent=4.814577102661133,free=7979008i,inactive=819159040i 1695658020000000000
But the inputs.disk job label is still persisting.
Does that help?
Sent you the logs via email.
Thanks again!
Does that help?
Unfortunately, I am not following. Let's look back and the code and make sure my understanding is correct:
Resource labels are generated at each metric, however there is another place where it is set initially via the resource_label
config option. You have the following:
[outputs.stackdriver.resource_labels]
cluster = "{{.CLUSTER_NAME}}"
job = "Telegraf"
instance = "{{.CLUSTER_NAME}}"
location = "{{.LOCATION}}"
namespace = "{{.NAMESPACE_LABEL}}"
This means that all metrics should get the above resource labels no matter what. You also have the following tags_as_resource_labels
set:
tags_as_resource_label = ["instance", "job"]
Which means if we come across a metric with a tag called instance
or job
we will set these resource labels as well and remove them from metric labels.
There is no sharing between metrics beyond what s.ResourceLabels
is set to in your config. Once created, the time series is not modified in anyway either by Telegraf.
Can you please share your metrics explorer query. Are you sure you aren't grouping by job?
Sure!
Here's some additional information from both the Metrics Explorer and Grafana:
First, here's the Metrics Explorer:
Here's the output from the query inspector in Grafana, note the mem_active_gauge
metric being populated with inputs.disk
:
Does that help?
Thanks!
That last image does shows me that the memory metric does in fact have the job input. How it got there though is not obvious.
When user shows up with an issue in grafana or the influxdb UI, we usually tell them that we need to reproduce this or demonstrate it with Telegraf. In your case, we have established that using the outputs.file plugin that metrics are returned with the correct job tag.
With the stackdriver output, I have outlined how each metric gets resource labels and we have printed out what they got set to correct. That said let's talk about your query:
Going back to your PromQL query, you are using just the metric name. Based on the Time series Selectors docs it says using a single metric name returns "all time series that have the... name". When you are pushing data your data via stackdriver it is sent as a timeseries and a timeseries can have many metrics. Sure enough if I search for just my disk value, I get all series that I pushed yesterday, both disk and memory:
Again this is because I was pushing one time series with multiple metrics. Once I specify which job tag it correctly returns a single value:
I do not see this same behavior with the builder:
It is not clear to me what to do here: is this a matter of being specific in your query? or has the way we have sent metrics for 5 years been wrong? Or is the use of PromQL requiring us to change how we push data?
Yeah, it's kind of a conundrum. I guess I would say that Google is also seeing this in their logs:
I'm seeing the following in the logs:
"[prometheus.googleapis.com/mem_available_gauge/gauge
io1-sandbox
us-east1-b
onprem
store_pc_metrics
inputs.processes
hostname
I'm also seeing
io1-sandbox
us-east1-b
onprem
store_pc_metrics
inputs.mem
hostname
and
io1-sandbox
us-east1-b
onprem
store_pc_metrics
inputs.disk
hostname
So it looks like all 3 values are making it in... however I can't really reconstruct any time series through the logs so all I can really say is that over the last few days we have seen all 3 values.
What that says to me is that even if I was specific in the query, the data appears to be labled differently.
I wish I could use Wireshark to inspect the packets, but all the data is encrypted. Is it easy to write what stackdriver plans on sending to file instead of sending?
I'm seeing the following in the logs:
These are all the resource labels attached to a certain metric? Do we have timestamps to ensure that these are all at the same time? And not just a one-off typo or misconfiguration that made it through?
I wish I could use Wireshark to inspect the packets, but all the data is encrypted. Is it easy to write what stackdriver plans on sending to file instead of sending?
I can put a print even closer to the request or try using a debugger to inspect what the request looks like.
What I would prefer is to see Google documentation that show the method we are currently using should be different or the approach used is wrong. Looking at golang-samples it seems to line up with what we do, as does the go-metrics-stackdriver.
Well the debugger might be our savior here. Makes me wonder who is actually using this plugin besides you ;)
# created and now adding time series
metric:{type:"test_mem_value/unknown"} resource:{type:"global" labels:{key:"job" value:"mem"} labels:{key:"project_id" value:"projects/[PROJECT]"}} metric_kind:GAUGE points:{interval:{end_time:{seconds:1694386800}} value:{double_value:100}}
map[job:disk project_id:projects/[PROJECT]]
# created and now adding time series
metric:{type:"test_disk_value/unknown"} resource:{type:"global" labels:{key:"job" value:"disk"} labels:{key:"project_id" value:"projects/[PROJECT]"}} metric_kind:GAUGE points:{interval:{end_time:{seconds:1694386800}} value:{double_value:42}}
# state of the time series right before we send:
[metric:{type:"test_mem_value/unknown"} resource:{type:"global" labels:{key:"job" value:"disk"} labels:{key:"project_id" value:"projects/[PROJECT]"}} metric_kind:GAUGE points:{interval:{end_time:{seconds:1694386800}} value:{double_value:100}}]
[metric:{type:"test_disk_value/unknown"} resource:{type:"global" labels:{key:"job" value:"disk"} labels:{key:"project_id" value:"projects/[PROJECT]"}} metric_kind:GAUGE points:{interval:{end_time:{seconds:1694386800}} value:{double_value:42}}]
Makes me wonder who is actually using this plugin besides you ;)
ROFL! Me too!
Am I reading this right? It looks like it was mem metric got the disk metric's label? If so, WOOOOHOOO! Is there anything I can do to help?
:facepalm: thou shall not shallow copy a map - I'll have a PR tomorrow
🤦 thou shall not shallow copy a map - I'll have a PR tomorrow
Can't wait to test it! Thanks for fixing this for us!
Alright, #14009 is ready for your testing ;)
Will do! I have a deployment today of Telegraf server, but will test after that. : )
Thanks again!
I lied, I couldn't wait, lol.
Initial testing looks good. I will have to do a deep dive of the metrics, but so far so good!
Testing performed:
Send metrics in from two plugins, inputs.disk
and inputs.mem
as shown below, with a global instance
tag with my computer name:
[inputs.disk]
[inputs.disk.tags]
job = "inputs.disk"
[inputs.mem]
[inputs.mem.tags]
job = "inputs.mem"
Then I stopped sending the job tag/label in inputs.mem, which started sending the default values as expected. The job names were set correctly, though I did see one thing that may not be a huge deal, but a bit odd. If the sending agent stops sending the job tag , and then starts again the first metric that is sent appears to be the default value (and in the case of the red time series it looks like the instance label also reverted briefly).
I then turned on inputs.processes to make the configuration more aggressive and noticed the same thing (also noticed that the instance was missing to):
It's almost like the first time it encounters the metric it doesn't inherit the tag as a resource label, but after that it works great. I'm guessing the impact here would be if the Telegraf server restarts the first metrics it receives will be wrong, not really a big deal as a single data point is pretty insignificant, but thought I would call it out. Another thing I noticed was that for that initial metric it also drops the tags associated with tags_as_resource_label
options while not applying them to the resource labels. I guess the impact would be that if the Telegraf server were to be restarted the first metrics received would be labeled incorrectly, which depending on your metrics velocity could indicate an outage.
That said, otherwise this is working great!
I'm going to continue playing with this, but that's my inital take. Any idea why this might be happening?
Thanks!
if the sending agent stops sending the job tag , and then starts again the first metric that is sent appears to be the default value (and in the case of the red time series it looks like the instance label also reverted briefly).
when you say stops sending the job tag, you mean you edit the config to remove the additional tag and restart telegraf to pick up the new config?
In that case, the metric coming in would not have a job tag, so the job from your resource labels config would get used:
[outputs.stackdriver.resource_labels]
job = "Telegraf"
Am I following?
I'm still fleshing this out, but the inital metric sent will have the default values in the config:
[outputs.stackdriver.resource_labels]
job = "Telegraf"
instance = "test"
The first data point has the instance
and job
tag dropped and the resource labels are the default values. Then it works correctly. As far restarting without the tag and the adding it back, I have seen that it can also have the first data point be the default values again.
That's what I'm seeing so far, does that help?
Thanks!
the inital metric sent will have the default values in the config:
If your metric has a tag that matches the values in tags_as_resource_label then it should include those tags. There should not be a delay or a situation where metric 1 is different than n+1.
I would add the [[outputs.file]]
plugin again and see what metrics you think you are getting and verify that they do in fact have the tags
Strangely, I suspended testing, turned off the Telegraf Server and Client, ate lunch, came back, and when I turned them both on again it's working fine. That said, I never disabled sending the instance
tag but it was indeed missing, weird.
It seems to be working now though.
ate lunch, came back
heh the best type of debugging :)
Yeah, I think this is working. I might have just been confused or something, not sure what happened with the instance the first time, but I think this looks good. Do you think this will make it into the Oct 2nd release?
Do you think this will make it into the Oct 2nd release?
yep!
Relevant telegraf.conf
Logs from Telegraf
The logs are unremarkable, but when outputting what the agent is sending to outputs.stackdriver using outputs.file I see that the tags are set correctly.
Example: Memory metric has
job=inputs.mem
Processes metric hasjob=inputs.processes
But when I look at the metrics in Google or Grafana I notice that it looks like the metrics are sometimes tagged with another inputs tag, IE
job=inputs.mem
but for a processes metric. From what I can tell it's usually the busiest metric. Additionally, it appears to be the last metric'stag
that is sent is the one that is used for all the entire batch of metrics. Including metrics that don't contain a jobtag
. Meaning that the default value isn't applied, instead whatever the last value for thetag
is used.Steps to reproduce
tags_as_resource_labels
forjob
tag.Expected behavior
Each metric will have their job tag (or any other applicable tag) applied as a resource label.
Actual behavior
The tag appears to change depending on what the last tag was sent. As you can see in the screenshot the job tag changed between
inputs.mem
andinputs.processes
, even though the sending agent didn't change. Here's an example of it changing applying a mismatchedjob
label on two different inputs:What also seems weird is that you can stop sending a tag altogether and the last sent tag is what continues to be sent instead of the default value:
Telegraf Client
outputs.file
output showing nojob
tag:Additional info
No response