influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.6k stars 5.57k forks source link

inputs.aliyuncms multiple dimensions doesnt work #10848

Closed ri0day closed 2 years ago

ri0day commented 2 years ago

Relevant telegraf.conf

[global_tags]

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false

[[outputs.opentsdb]]
  host = "http://127.0.0.1"
  port = 19000
  http_batch_size = 50
  http_path = "/opentsdb/put"
  debug = false
  separator = "_"

[[inputs.aliyuncms]]
  regions = ["cn-hangzhou"]
  period = "5m"
  delay = "1m"
  interval = "5m"
  project = "acs_rds_dashboard"
  ratelimit = 200
  [[inputs.aliyuncms.metrics]]
    names = ["ConnectionUsage", "CpuUsage","DiskUsage","IOPSUsage","MemoryUsage"]
    dimensions = '[{"instanceId": "rm-bp1zureya4l415lus"},{"instanceId": "rm-bp161628eanrceqz2"}]'

Logs from Telegraf

[root@d5cs-jobs telegraf]# ./telegraf --config ./telegraf-aliyun.conf --test --debug 2022-03-18T04:29:18Z I! Starting Telegraf 1.21.4 2022-03-18T04:29:18Z I! Loaded inputs: aliyuncms 2022-03-18T04:29:18Z I! Loaded aggregators: 2022-03-18T04:29:18Z I! Loaded processors: 2022-03-18T04:29:18Z W! Outputs are not used in testing mode! 2022-03-18T04:29:18Z I! Tags enabled: host=d5cs-jobs 2022-03-18T04:29:18Z D! [agent] Initializing plugins 2022-03-18T04:29:18Z E! [telegraf] Error running agent: could not initialize input inputs.aliyuncms: Can't parse dimensions (it is neither obj, nor array) "[{\"instanceId\": \"rm-bp1zureya4l415lus\"},{\"instanceId\": \"rm-bp161628eanrceqz2\"}]" :

System info

Telegraf 1.21.4

Docker

No response

Steps to reproduce

  1. running test command with ./telegraf --config ./telegraf-aliyun.conf --test --debug

...

Expected behavior

according to plugin example, we can pass an quoted arrary to dimension variable like dimensions = '[{"instanceId": "p-example"},{"instanceId": "q-example"}]'

Actual behavior

marshal failed

Additional info

dimensions = '{"instanceId": "p-example"}' signle dimensions is work properly

dimensions = [{"instanceId": "rm-bp1zureya4l415lus"},{"instanceId": "rm-bp161628eanrceqz2"}] also failed

powersj commented 2 years ago

Hi!

Good catch! Apparently, no one has ever tried to use an array and the lack of a test for an array never caught the logic issue. Check out https://github.com/influxdata/telegraf/pull/10850 as I put up a fix with tests.

ri0day commented 2 years ago

thanks for the quick fix ,however , i found the signle dimension also dosen't work as expected , i set just one rds instanceId in dimension ,but got all instances metrics outputed . i was expect get ["ConnectionUsage", "CpuUsage","DiskUsage","IOPSUsage","MemoryUsage"] metrics just from instance rm-bp1zureya4l415lus

[root@d5cs-jobs telegraf]# tail telegraf-aliyun.conf 
[[inputs.aliyuncms]]
  regions = ["cn-hangzhou"]
  period = "5m"
  delay = "1m"
  interval = "5m"
  project = "acs_rds_dashboard"
  ratelimit = 200
  [[inputs.aliyuncms.metrics]]
    names = ["ConnectionUsage", "CpuUsage","DiskUsage","IOPSUsage","MemoryUsage"]
    dimensions = '{"instanceId":"rm-bp1zureya4l415lus"}'

[root@d5cs-jobs telegraf]# ./telegraf --config telegraf-aliyun.conf --test --debug 2022-03-18T15:20:44Z I! Starting Telegraf 1.21.4 2022-03-18T15:20:44Z I! Loaded inputs: aliyuncms 2022-03-18T15:20:44Z I! Loaded aggregators: 2022-03-18T15:20:44Z I! Loaded processors: 2022-03-18T15:20:44Z W! Outputs are not used in testing mode! 2022-03-18T15:20:44Z I! Tags enabled: host=d5cs-jobs 2022-03-18T15:20:44Z D! [agent] Initializing plugins 2022-03-18T15:20:44Z E! [inputs.aliyuncms] Discovery tool is not activated: Didn't find root key "DBInstances" in discovery response 2022-03-18T15:20:44Z D! [agent] Starting service inputs aliyuncms_acs_rds_dashboard,host=d5cs-jobs,instanceId=rm-bp16j0l01ze9916oz,userId=1692386295190525 connection_usage_average=0.585,connection_usage_maximum=0.588,connection_usage_minimum=0.583 1647616500000000000 aliyuncms_acs_rds_dashboard,host=d5cs-jobs,instanceId=rm-bp15ha3043v8030wj,userId=1692386295190525 connection_usage_average=7.523,connection_usage_maximum=7.524,connection_usage_minimum=7.524 1647616500000000000 aliyuncms_acs_rds_dashboard,host=d5cs-jobs,instanceId=rm-bp1q665hm9edarj7f,userId=1692386295190525 connection_usage_average=0.51,connection_usage_maximum=0.537,connection_usage_minimum=0.5 1647616500000000000 aliyuncms_acs_rds_dashboard,host=d5cs-jobs,instanceId=rm-bp14c70a2wh68v059,userId=1692386295190525 connection_usage_average=5.078,connection_usage_maximum=5.199,connection_usage_minimum=5.038 1647616500000000000 aliyuncms_acs_rds_dashboard,host=d5cs-jobs,instanceId=rm-bp1jmt70a9tfpyofw,userId=1692386295190525 connection_usage_average=0.625,connection_usage_maximum=0.625,connection_usage_minimum=0.625 1647616500000000000 ......

ri0day commented 2 years ago

i tested with your fixed build ,i can confirm the plugins dimensions variable can accept an string quoted arrary now, but the dimension seem don't work as we expected, because the plugin still fetch all instances metrics .

powersj commented 2 years ago

@ri0day,

Thanks for trying the fix and confirming it works!

dimension seem don't work as we expected, because the plugin still fetch all instances metrics .

Hmm, I looked around and only saw reference in this bug to how dimensions might get silently ignored. I have pushed another change that will print out the request and the dimensions variable to see how it is formatted. Once those artifacts build, can you provide that output?

Thanks!

powersj commented 2 years ago

For my own reference, found the English docs looks like the dimensions should look like a JSON string:

{\"userId\":\"120886317861****\",\"region\":\"cn-huhehaote\",\"queue\":\"test-0128\"}

Getting that debug output would be good to confirm that we are actually sending the right type of data.

ri0day commented 2 years ago

hi @powersj just tried you latest fix build(telegraf-1.22.0~06899624_linux_amd64.tar.gz) ,the dimension print out the memory address

[root@d5cs-jobs bin]# ./telegraf --config /opt/telegraf/telegraf-aliyun.conf --test-wait 10 --debug
2022-03-19T01:14:11Z I! Starting Telegraf 1.22.0-06899624
2022-03-19T01:14:11Z I! Loaded inputs: aliyuncms
2022-03-19T01:14:11Z I! Loaded aggregators: 
2022-03-19T01:14:11Z I! Loaded processors: 
2022-03-19T01:14:11Z W! Outputs are not used in testing mode!
2022-03-19T01:14:11Z I! Tags enabled: host=d5cs-jobs
2022-03-19T01:14:11Z D! [agent] Initializing plugins
2022-03-19T01:14:12Z E! [inputs.aliyuncms] Discovery tool is not activated: Didn't find root key "DBInstances" in discovery response
2022-03-19T01:14:12Z D! [agent] Starting service inputs
Making the following request:
Making the following request:
Making the following request:
&{0xc00058c058  1647652092000  CpuUsage 300 10000 1647652392000 acs_rds_dashboard }
Request Dimensions:

&{0xc00048e3b0  1647652092000  IOPSUsage 300 10000 1647652392000 acs_rds_dashboard }
&{0xc00059c270  1647652092000  ConnectionUsage 300 10000 1647652392000 acs_rds_dashboard }
Request Dimensions:

Making the following request:
Making the following request:
&{0xc000011278  1647652092000  DiskUsage 300 10000 1647652392000 acs_rds_dashboard }
Request Dimensions:

Request Dimensions:

&{0xc00048e3b8  1647652092000  MemoryUsage 300 10000 1647652392000 acs_rds_dashboard }
Request Dimensions:
......
2022-03-19T01:14:22Z D! [agent] Stopping service inputs
2022-03-19T01:14:22Z D! [agent] Input channel closed
2022-03-19T01:14:22Z D! [agent] Stopped Successfully
2022-03-19T01:14:22Z E! [telegraf] Error running agent: input plugins recorded 1 errors
powersj commented 2 years ago

I had the PR print 2 things, first the request object, which is the memory-like results you see. The second thing was the request dimensions string itself:

Request Dimensions:

Request Dimensions:

This shows that no dimensions are specified in the request. Looking at the function and your logs more I noticed this message:

2022-03-19T01:14:12Z E! [inputs.aliyuncms] Discovery tool is not activated: Didn't find root key "DBInstances" in discovery response

When the discovery tool is not active, this sets s.dt to nil. When it is nil the dimensions will not be configured. Is this a key you are specifying?

ri0day commented 2 years ago

this is the config i used to test seems no discovery related config here

[[inputs.aliyuncms]]
  regions = ["cn-hangzhou"]
  period = "5m"
  delay = "1m"
  interval = "5m"
  project = "acs_rds_dashboard"
  ratelimit = 200
  [[inputs.aliyuncms.metrics]]
    names = ["ConnectionUsage", "CpuUsage","DiskUsage","IOPSUsage","MemoryUsage"]
    dimensions = '[{"instanceId":"rm-bp1zureya4l415lus"},{"instanceId":"rm-bp13g5b435ex60of3"}]'
ri0day commented 2 years ago

for the discovery error ,

2022-03-19T01:14:12Z E! [inputs.aliyuncms] Discovery tool is not activated: Didn't find root key "DBInstances" in discovery response

maybe the plugin is expecting DBInstances in response ,but the aliyun cloud monitor api response is actually like this ,maybe the responseRootKey in discovery should be DBInstance instead of DBInstances

{
  "TotalRecordCount": 1,
  "PageRecordCount": 1,
  "RequestId": "8EED1083-3902-557A-9AF4-822BE5C9AF14",
  "NextToken": "o7PORW53prZg8NUW9EJ7Yw",
  "PageNumber": 1,
  "Items": {
    "DBInstance": [
      {
        "ResourceGroupId": "rg-acfm2jr35xnjh7i",
        "DBInstanceNetType": "Intranet",
        "DBInstanceType": "Primary",
        "MutriORsignle": false,
        "InstanceNetworkType": "VPC",
        "DBInstanceId": "rm-bp1075l0623jbo084",
        "ReadOnlyDBInstanceIds": {
          "ReadOnlyDBInstanceId": []
        },
        "DBInstanceDescription": "CMBG-live-DB6",
        "Engine": "MySQL",
        "EngineVersion": "5.7",
        "ZoneId": "cn-hangzhou-i",
        "DBInstanceStatus": "Running",
        "DBInstanceClass": "mysql.n2.large.2c",
        "CreateTime": "2022-03-11T01:52:20Z",
        "VSwitchId": "vsw-bp15efgph9rd6rl7xqgm5",
        "TipsLevel": 0,
        "PayType": "Prepaid",
        "LockMode": "Unlock",
        "DeletionProtection": false,
        "DBInstanceStorageType": "cloud_essd",
        "InsId": 1,
        "VpcId": "vpc-232m6l510",
        "ConnectionMode": "Standard",
        "VpcCloudInstanceId": "rm-bp1075l0623jbo084-202203110952",
        "RegionId": "cn-hangzhou",
        "ConnectionString": "rm-bp1075l0623jbo084.mysql.rds.aliyuncs.com",
        "ExpireTime": "2025-03-11T16:00:00Z"
      }
    ]
  }
}
powersj commented 2 years ago

Thanks for that! You are right, it does look like a different root key, then say the acs_ecs_dashboard:

{
  "Instances": {
    "Instance": [
      {

versus what you see with the acs_rds_dashboard:

{
  "Items": {
    "DBInstance": [
      {

I pushed another couple of commits to update the SDK and try to look for "Items" as a root key as well. Can you give that a shot?

ri0day commented 2 years ago

Hi, @powersj ,i just tried you latest build(telegraf-1.22.0~805d150b_linux_amd64.tar.gz) ,the discovery function seem running forever, check output screenshot

output gif(about 50M) can we schedule an pair programing session to debug this?

powersj commented 2 years ago

hmm, thanks for the gif!

I am really not sure what direction to go with this. I am still confused why the response does not have the DBInstances expected value in the first place either. That expected value comes from aliyun's own SDK. As such I am wondering if it is worth filing a bug with them to see if the response format changed and if the SDK needs an update?

ri0day commented 2 years ago

the different service will have different api resonpse rootkey, loadbalancer rootkey is LoadBalancers rds rootkey is items ,ecs rootkey is Instances you can check https://next.api.alibabacloud.com/home to explore the api without write code

for example CreateDescribeLoadBalancersRequest() it's return

{
                  .....
             "LoadBalancers": {
                 "LoadBalancer": [ here comes objects, one per every instance]
            }
        }

but CreateDescribeDBInstancesRequest() it's return

{
......
  "Items": {
    "DBInstance": [

so maybe we can not using this pattern to catch the rootkey

parseRootKey          = regexp.MustCompile(`Describe(.*)`)

we can define the rootkey in here

switch project {
        case "acs_ecs_dashboard":
            dscReq[region] = ecs.CreateDescribeInstancesRequest()
            responseObjectIDKey = "InstanceId"
powersj commented 2 years ago

Hi,

You are using the acs_rds_dashboard, right? In that case, I updated that case with Items:

        case "acs_rds_dashboard":
            dscReq[region] = rds.CreateDescribeDBInstancesRequest()
            responseObjectIDKey = "Items"
ri0day commented 2 years ago

hi @powersj ,tired you lasted build(telegraf-1.23.0~90a862ee_linux_amd64.tar.gz) still output all the instance metrics ,but f fond the aliyuncms api for rds is totally different from others service

for example the rds response is like this:

{"TotalRecordCount":116,"PageRecordCount":100,"RequestId":"AF7BFA71-652D-5982-BEFE-B2B0365F68A2","NextToken":"o7PORW5vm_Zg8NUW9EJ7Yw","PageNumber":1,"Items":{"DBInstance":[xxxxx]}

but in our plugin ,we parse it with different keyword

case "TotalCount":
            pdResp.totalCount = int(val.(float64))
        case "PageSize":
            pdResp.pageSize = int(val.(float64))
        case "PageNumber":
            pdResp.pageNumber = int(val.(float64))
        }

in order get rds discovery working we need code like this

               case "TotalRecordCount":
                        pdResp.totalCount = int(val.(float64))
                case "PageRecordCount":
                        pdResp.pageSize = int(val.(float64))
                case "PageNumber":
                        pdResp.pageNumber = int(val.(float64))
                }

and another different is ,in order to get instance data from aliyum cms response rds service instance data need retrieve from $Response.Items.DBInstance the other service can get instance data from $Response.$ServiceName+"s" for example: loadbalancer service -->$Response.LoadBalancers.LoadBalancer ecs service --> $Response.Instances.Instance

ri0day commented 2 years ago

i managed get an barely working code just for acs_rds_dashboard ,you can check out aliyuncms.go discovery.go

powersj commented 2 years ago

@ri0day - huge thank you for diving in on this and working it out. I have updated the PR with your changes, with a few modifications to ensure that the previous behavior works as well. Could you give the PR a try?

Thanks!

ri0day commented 2 years ago

@powersj i tested your latest build ,i can confirm ,rds ,ecs are working as expected thank you