Compatibility issue for Telegraf with Istio 1.21+

wanlonghenry commented 3 months ago

Relevant telegraf.conf

# Telegraf Configuration
#
# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.
#
# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.
#
# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.
#
# Environment variables can be used anywhere in this config file, simply prepend
# them with $. For strings the variable must be within quotes (ie, "$STR_VAR"),
# for numbers and booleans they should be plain (ie, $INT_VAR, $BOOL_VAR)

# Global tags can be specified here in key="value" format.
[global_tags]
  #Below are entirely used for telemetry
  #AgentVersion = "$AGENT_VERSION"
  #AKS_RESOURCE_ID = "$TELEMETRY_AKS_RESOURCE_ID"
  #ACS_RESOURCE_NAME = "$TELEMETRY_ACS_RESOURCE_NAME"
  #Region = "$TELEMETRY_AKS_REGION"
  #ClusterName = "$TELEMETRY_CLUSTER_NAME"
  #ClusterType = "$TELEMETRY_CLUSTER_TYPE"
  #Computer = "placeholder_hostname"
  #ControllerType = "$CONTROLLER_TYPE"

  hostName = "placeholder_hostname"

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "60s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 1000

  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  ## This buffer only fills when writes fail to output plugin(s).
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "15s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  precision = ""

  ## Logging configuration:
  ## Run telegraf with debug log messages.
  debug = false
  ## Run telegraf in quiet mode (error log messages only).
  quiet = true
  ## Specify the log file name. The empty string means to log to stderr.
  logfile = ""
  ## Override default hostname, if empty use os.Hostname()
  #hostname = "placeholder_hostname"
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = true

###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

# Generic socket writer capable of handling multiple socket types.
[[outputs.socket_writer]]
  ## URL to connect to
  address = "tcp://0.0.0.0:25226"
  # address = "tcp://example.com:http"
  # address = "tcp4://127.0.0.1:8094"
  # address = "tcp6://127.0.0.1:8094"
  # address = "tcp6://[2001:db8::1]:8094"
  # address = "udp://127.0.0.1:8094"
  # address = "udp4://127.0.0.1:8094"
  # address = "udp6://127.0.0.1:8094"
  # address = "unix:///tmp/telegraf.sock"
  # address = "unixgram:///tmp/telegraf.sock"

  ## Optional TLS Config
  # tls_ca = "/etc/telegraf/ca.pem"
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false

  ## Period between keep alive probes.
  ## Only applies to TCP sockets.
  ## 0 disables keep alive probes.
  ## Defaults to the OS configuration.
  # keep_alive_period = "5m"

  ## Data format to generate.
  ## Each data format has its own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
  data_format = "json"
  namedrop = ["agent_telemetry", "file"]
  #tagdrop = ["AgentVersion","AKS_RESOURCE_ID", "ACS_RESOURCE_NAME", "Region","ClusterName","ClusterType", "Computer", "ControllerType"]

# Output to send MDM metrics to fluent bit and then route it to fluentD
[[outputs.socket_writer]]
  ## URL to connect to
  address = "tcp://0.0.0.0:25228"
  # address = "tcp://example.com:http"
  # address = "tcp4://127.0.0.1:8094"
  # address = "tcp6://127.0.0.1:8094"
  # address = "tcp6://[2001:db8::1]:8094"
  # address = "udp://127.0.0.1:8094"
  # address = "udp4://127.0.0.1:8094"
  # address = "udp6://127.0.0.1:8094"
  # address = "unix:///tmp/telegraf.sock"
  # address = "unixgram:///tmp/telegraf.sock"

  ## Optional TLS Config
  # tls_ca = "/etc/telegraf/ca.pem"
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false

  ## Period between keep alive probes.
  ## Only applies to TCP sockets.
  ## 0 disables keep alive probes.
  ## Defaults to the OS configuration.
  # keep_alive_period = "5m"

  ## Data format to generate.
  ## Each data format has its own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
  data_format = "json"
  namepass = ["container.azm.ms/disk"]
  #fieldpass = ["used_percent"]

# [[outputs.application_insights]]
#   ## Instrumentation key of the Application Insights resource.
#   instrumentation_key = "$TELEMETRY_APPLICATIONINSIGHTS_KEY"

#   ## Timeout for closing (default: 5s).
#   # timeout = "5s"

#   ## Enable additional diagnostic logging.
#   enable_diagnostic_logging = false

  ## Context Tag Sources add Application Insights context tags to a tag value.
  ##
  ## For list of allowed context tag keys see:
  ## https://github.com/Microsoft/ApplicationInsights-Go/blob/master/appinsights/contracts/contexttagkeys.go
  # [outputs.application_insights.context_tag_sources]
  #   "ai.cloud.role" = "kubernetes_container_name"
  #   "ai.cloud.roleInstance" = "kubernetes_pod_name"
  # namepass = ["agent_telemetry"]
  #tagdrop = ["nodeName"]

###############################################################################
#                            PROCESSOR PLUGINS                                #
###############################################################################

[[processors.converter]]
  [processors.converter.fields]
    float = ["*"]
# # Perform string processing on tags, fields, and measurements
#[[processors.rename]]
  #[[processors.rename.replace]]
  #   measurement = "disk"
  #   dest = "nodes"
#  [[processors.rename.replace]]
#     field = "free"
#     dest = "freeBytes"
#  [[processors.rename.replace]]
#     field = "used"
#     dest = "usedBytes"
#  [[processors.rename.replace]]
#     field = "used_percent"
#     dest = "usedPercentage"
  #[[processors.rename.replace]]
  #   measurement = "net"
  #   dest = "nodes"
  #[[processors.rename.replace]]
  #   field = "bytes_recv"
  #   dest = "networkBytesReceivedTotal"
  #[[processors.rename.replace]]
  #   field = "bytes_sent"
  #   dest = "networkBytesSentTotal"
  #[[processors.rename.replace]]
  #   field = "err_in"
  #   dest = "networkErrorsInTotal"
  #[[processors.rename.replace]]
  #   field = "err_out"
  #   dest = "networkErrorsOutTotal"
  #[[processors.rename.replace]]
  #   measurement = "kubernetes_pod_volume"
  #   dest = "pods"
  #[[processors.rename.replace]]
  #   field = "used_bytes"
  #   dest = "podVolumeUsedBytes"
  #[[processors.rename.replace]]
  #   field = "available_bytes"
  #   dest = "podVolumeAvailableBytes"
  #[[processors.rename.replace]]
  #   measurement = "kubernetes_pod_network"
  #   dest = "pods"
  #[[processors.rename.replace]]
  #   field = "tx_errors"
  #   dest = "podNetworkTxErrorsTotal"
  #[[processors.rename.replace]]
  #   field = "rx_errors"
  #   dest = "podNetworkRxErrorsTotal"
  #[[processors.rename.replace]]
  #   tag = "volume_name"
  #   dest = "volumeName"
  #[[processors.rename.replace]]
  #   tag = "pod_name"
  #   dest = "podName"
  #[[processors.rename.replace]]
  #   measurement = "docker"
  #   dest = "containers"
  #[[processors.rename.replace]]
  #   measurement = "docker_container_status"
  #   dest = "containers"
  #[[processors.rename.replace]]
  #   field = "n_containers"
  #   dest = "numContainers"
  #[[processors.rename.replace]]
  #   field = "n_containers_running"
  #   dest = "numContainersRunning"
  #[[processors.rename.replace]]
  #   field = "n_containers_stopped"
  #   dest = "numContainersStopped"
  #[[processors.rename.replace]]
  #   field = "n_containers_paused"
  #   dest = "numContainersPaused"
  #[[processors.rename.replace]]
  #   field = "n_images"
  #   dest = "numContainerImages"

#   ## Convert a tag value to uppercase
#   # [[processors.strings.uppercase]]
#   #   tag = "method"
#
#   ## Convert a field value to lowercase and store in a new field
#   # [[processors.strings.lowercase]]
#   #   field = "uri_stem"
#   #   dest = "uri_stem_normalised"
#
#   ## Trim leading and trailing whitespace using the default cutset
#   # [[processors.strings.trim]]
#   #   field = "message"
#
#   ## Trim leading characters in cutset
#   # [[processors.strings.trim_left]]
#   #   field = "message"
#   #   cutset = "\t"
#
#   ## Trim trailing characters in cutset
#   # [[processors.strings.trim_right]]
#   #   field = "message"
#   #   cutset = "\r\n"
#
#   ## Trim the given prefix from the field
#   # [[processors.strings.trim_prefix]]
#   #   field = "my_value"
#   #   prefix = "my_"
#
#   ## Trim the given suffix from the field
#   # [[processors.strings.trim_suffix]]
#   #   field = "read_count"
#   #   suffix = "_count"

# # Print all metrics that pass through this filter.
# [[processors.topk]]
#   ## How many seconds between aggregations
#   # period = 10
#
#   ## How many top metrics to return
#   # k = 10
#
#   ## Over which tags should the aggregation be done. Globs can be specified, in
#   ## which case any tag matching the glob will aggregated over. If set to an
#   ## empty list is no aggregation over tags is done
#   # group_by = ['*']
#
#   ## Over which fields are the top k are calculated
#   # fields = ["value"]
#
#   ## What aggregation to use. Options: sum, mean, min, max
#   # aggregation = "mean"
#
#   ## Instead of the top k largest metrics, return the bottom k lowest metrics
#   # bottomk = false
#
#   ## The plugin assigns each metric a GroupBy tag generated from its name and
#   ## tags. If this setting is different than "" the plugin will add a
#   ## tag (which name will be the value of this setting) to each metric with
#   ## the value of the calculated GroupBy tag. Useful for debugging
#   # add_groupby_tag = ""
#
#   ## These settings provide a way to know the position of each metric in
#   ## the top k. The 'add_rank_field' setting allows to specify for which
#   ## fields the position is required. If the list is non empty, then a field
#   ## will be added to each and every metric for each string present in this
#   ## setting. This field will contain the ranking of the group that
#   ## the metric belonged to when aggregated over that field.
#   ## The name of the field will be set to the name of the aggregation field,
#   ## suffixed with the string '_topk_rank'
#   # add_rank_fields = []
#
#   ## These settings provide a way to know what values the plugin is generating
#   ## when aggregating metrics. The 'add_agregate_field' setting allows to
#   ## specify for which fields the final aggregation value is required. If the
#   ## list is non empty, then a field will be added to each every metric for
#   ## each field present in this setting. This field will contain
#   ## the computed aggregation for the group that the metric belonged to when
#   ## aggregated over that field.
#   ## The name of the field will be set to the name of the aggregation field,
#   ## suffixed with the string '_topk_aggregate'
#   # add_aggregate_fields = []

###############################################################################
#                            AGGREGATOR PLUGINS                               #
###############################################################################
# [[aggregators.quantile]]
#   period = "30m"
#   drop_original = true
#   quantiles = [0.95]
#   algorithm = "t-digest"
#   compression = 100.0
#   namepass = ["t.azm.ms/agent_telemetry"]
# # Keep the aggregate basicstats of each metric passing through.
# [[aggregators.basicstats]]
#   ## General Aggregator Arguments:
#   ## The period on which to flush & clear the aggregator.
#   period = "30s"
#   ## If true, the original metric will be dropped by the
#   ## aggregator and will not get sent to the output plugins.
#   drop_original = false

# # Create aggregate histograms.
# [[aggregators.histogram]]
#   ## The period in which to flush the aggregator.
#   period = "30s"
#
#   ## If true, the original metric will be dropped by the
#   ## aggregator and will not get sent to the output plugins.
#   drop_original = false
#
#   ## Example config that aggregates all fields of the metric.
#   # [[aggregators.histogram.config]]
#   #   ## The set of buckets.
#   #   buckets = [0.0, 15.6, 34.5, 49.1, 71.5, 80.5, 94.5, 100.0]
#   #   ## The name of metric.
#   #   measurement_name = "cpu"
#
#   ## Example config that aggregates only specific fields of the metric.
#   # [[aggregators.histogram.config]]
#   #   ## The set of buckets.
#   #   buckets = [0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0]
#   #   ## The name of metric.
#   #   measurement_name = "diskio"
#   #   ## The concrete fields of metric
#   #   fields = ["io_time", "read_time", "write_time"]

# # Keep the aggregate min/max of each metric passing through.
# [[aggregators.minmax]]
#   ## General Aggregator Arguments:
#   ## The period on which to flush & clear the aggregator.
#   period = "30s"
#   ## If true, the original metric will be dropped by the
#   ## aggregator and will not get sent to the output plugins.
#   drop_original = false

# # Count the occurance of values in fields.
# [[aggregators.valuecounter]]
#   ## General Aggregator Arguments:
#   ## The period on which to flush & clear the aggregator.
#   period = "30s"
#   ## If true, the original metric will be dropped by the
#   ## aggregator and will not get sent to the output plugins.
#   drop_original = false
#   ## The fields for which the values will be counted
#   fields = []

###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

# Read metrics about cpu usage
#[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
#  percpu = false
  ## Whether to report total system cpu stats or not
#  totalcpu = true
  ## If true, collect raw CPU time metrics.
#  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states.
#  report_active = true
#  fieldpass = ["usage_active","cluster","node","host","device"]
#  taginclude = ["cluster","cpu","node"]

# Dummy plugin to test out toml parsing happens properly
[[inputs.file]]
  interval = "24h"
  files = ["test.json"]
  data_format = "json"

# Read metrics about disk usage by mount point
[[inputs.disk]]
  name_prefix="container.azm.ms/"
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  # mount_points = ["/"]

  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "overlay", "aufs", "squashfs", "cifs", "fuse"]
  fieldpass = ["free", "used", "used_percent"]
  taginclude = ["device","path","hostName"]
  # Below due to Bug - https://github.com/influxdata/telegraf/issues/5615
  # ORDER matters here!! - i.e the below should be the LAST modifier
  [inputs.disk.tagdrop]
    path = ["/var/lib/kubelet*", "/dev/termination-log", "/var/log", "/etc/hosts", "/etc/resolv.conf", "/etc/hostname", "/etc/kubernetes/host", "/var/lib/docker/containers", "/etc/config/settings", "/run/host/containerd/io.containerd.runtime.v2.task/k8s.io/*"]

# Read metrics about memory usage
#[[inputs.mem]]
#  fieldpass = ["used_percent", "cluster", "node","host","device"]
#  taginclude = ["cluster","node"]

# Read metrics about disk IO by device
[[inputs.diskio]]
  name_prefix="container.azm.ms/"
  ## By default, telegraf will gather stats for all devices including
  ## disk partitions.
  ## Setting devices will restrict the stats to the specified devices.
  devices = ["sd[a-z][0-9]", "xvd[a-z]"]
  ## Uncomment the following line if you need disk serial numbers.
  # skip_serial_number = false
  #
  ## On systems which support it, device metadata can be added in the form of
  ## tags.
  ## Currently only Linux is supported via udev properties. You can view
  ## available properties for a device by running:
  ## 'udevadm info -q property -n /dev/sda'
  ## Note: Most, but not all, udev properties can be accessed this way. Properties
  ## that are currently inaccessible include DEVTYPE, DEVNAME, and DEVPATH.
  # device_tags = ["ID_FS_TYPE", "ID_FS_USAGE"]
  #
  ## Using the same metadata source as device_tags, you can also customize the
  ## name of the device via templates.
  ## The 'name_templates' parameter is a list of templates to try and apply to
  ## the device. The template may contain variables in the form of '$PROPERTY' or
  ## '${PROPERTY}'. The first template which does not contain any variables not
  ## present for the device is used as the device name tag.
  ## The typical use case is for LVM volumes, to get the VG/LV name instead of
  ## the near-meaningless DM-0 name.
  # name_templates = ["$ID_FS_LABEL","$DM_VG_NAME/$DM_LV_NAME"]
  fieldpass = ["reads", "read_bytes", "read_time", "writes", "write_bytes", "write_time", "io_time", "iops_in_progress"]
  taginclude = ["name","hostName"]

# Read metrics about network interface usage
[[inputs.net]]
  name_prefix="container.azm.ms/"
  ## By default, telegraf gathers stats from any up interface (excluding loopback)
  ## Setting interfaces will tell it to gather these explicit interfaces,
  ## regardless of status.
  ##
  # interfaces = ["eth0"]
  ##
  ## On linux systems telegraf also collects protocol stats.
  ## Setting ignore_protocol_stats to true will skip reporting of protocol metrics.
  ##
  ignore_protocol_stats = true
  ##
  fieldpass = ["bytes_recv", "bytes_sent", "err_in", "err_out"]
  taginclude = ["interface","hostName"]

# Read metrics from the kubernetes kubelet api
#[[inputs.kubernetes]]
  ## URL for the kubelet
  #url = "http://1.1.1.1:10255"
#  url = "http://placeholder_nodeip:10255"

  ## Use bearer token for authorization
  # bearer_token = /path/to/bearer/token

  ## Set timeout (default 5 seconds)
  # timeout = "5s"

  ## Optional TLS Config
  # tls_ca = /path/to/cafile
  # tls_cert = /path/to/certfile
  # tls_key = /path/to/keyfile
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false
#  fieldpass = ["used_bytes", "available_bytes", "tx_errors", "rx_errors"  ]
#  taginclude = ["volume_name","nodeName","namespace","pod_name"]
# Read metrics about docker containers
#[[inputs.docker]]
  ## Docker Endpoint
  ##   To use TCP, set endpoint = "tcp://[ip]:[port]"
  ##   To use environment variables (ie, docker-machine), set endpoint = "ENV"
#  endpoint = "unix:///var/run/host/docker.sock"

  ## Set to true to collect Swarm metrics(desired_replicas, running_replicas)
#  gather_services = false

  ## Only collect metrics for these containers, collect all if empty
#  container_names = []

  ## Containers to include and exclude. Globs accepted.
  ## Note that an empty array for both will include all containers
#  container_name_include = []
#  container_name_exclude = []

  ## Container states to include and exclude. Globs accepted.
  ## When empty only containers in the "running" state will be captured.
#  container_state_include = ['*']
  # container_state_exclude = []

  ## Timeout for docker list, info, and stats commands
#  timeout = "5s"

  ## Whether to report for each container per-device blkio (8:0, 8:1...) and
  ## network (eth0, eth1, ...) stats or not
#  perdevice = true
  ## Whether to report for each container total blkio and network stats or not
#  total = true
  ## Which environment variables should we use as a tag
  ##tag_env = ["JAVA_HOME", "HEAP_SIZE"]

  ## docker labels to include and exclude as tags.  Globs accepted.
  ## Note that an empty array for both will include all labels as tags
#  docker_label_include = []
#  docker_label_exclude = []

  ## Optional TLS Config
  # tls_ca = "/etc/telegraf/ca.pem"
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false
#  fieldpass = ["n_containers", "n_containers_running", "n_containers_stopped", "n_containers_paused", "n_images"]
  #fieldpass = ["numContainers", "numContainersRunning", "numContainersStopped", "numContainersPaused", "numContainerImages"]
#  taginclude = ["nodeName"]

# [[inputs.procstat]]
#  name_prefix="t.azm.ms/"
#  exe = "mdsd"
#  interval = "60s"
#  pid_finder = "native"
#  pid_tag = true
#  name_override = "agent_telemetry"
#  fieldpass = ["cpu_usage", "memory_rss"]
#  [inputs.procstat.tags]
#    Computer = "$NODE_NAME"
#    AgentVersion = "$AGENT_VERSION"
#    ControllerType = "$CONTROLLER_TYPE"
#    AKS_RESOURCE_ID = "$TELEMETRY_AKS_RESOURCE_ID"
#    ACSResourceName = "$TELEMETRY_ACS_RESOURCE_NAME"
#    Region = "$TELEMETRY_AKS_REGION"
# [[inputs.procstat]]
#  #name_prefix="container.azm.ms/"
#  exe = "ruby"
#  interval = "10s"
#  pid_finder = "native"
#  pid_tag = true
#  name_override = "agent_telemetry"
#  fieldpass = ["cpu_usage", "memory_rss", "memory_swap", "memory_vms", "memory_stack"]
#  [inputs.procstat.tags]
#    Computer = "$NODE_NAME"
#    AgentVersion = "$AGENT_VERSION"
#    ControllerType = "$CONTROLLER_TYPE"
#    AKS_RESOURCE_ID = "$TELEMETRY_AKS_RESOURCE_ID"
#    ACSResourceName = "$TELEMETRY_ACS_RESOURCE_NAME"
#    Region = "$TELEMETRY_AKS_REGION"
# [[inputs.procstat]]
#  #name_prefix="container.azm.ms/"
#  exe = "fluent-bit"
#  interval = "10s"
#  pid_finder = "native"
#  pid_tag = true
#  name_override = "agent_telemetry"
#  fieldpass = ["cpu_usage", "memory_rss", "memory_swap", "memory_vms", "memory_stack"]
#  [inputs.procstat.tags]
#    Computer = "$NODE_NAME"
#    AgentVersion = "$AGENT_VERSION"
#    ControllerType = "$CONTROLLER_TYPE"
#    AKS_RESOURCE_ID = "$TELEMETRY_AKS_RESOURCE_ID"
#    ACSResourceName = "$TELEMETRY_ACS_RESOURCE_NAME"
#    Region = "$TELEMETRY_AKS_REGION"
# [[inputs.procstat]]
#  #name_prefix="container.azm.ms/"
#  exe = "telegraf"
#  interval = "10s"
#  pid_finder = "native"
#  pid_tag = true
#  name_override = "agent_telemetry"
#  fieldpass = ["cpu_usage", "memory_rss", "memory_swap", "memory_vms", "memory_stack"]
#  [inputs.procstat.tags]
#    Computer = "$NODE_NAME"
#    AgentVersion = "$AGENT_VERSION"
#    ControllerType = "$CONTROLLER_TYPE"
#    AKS_RESOURCE_ID = "$TELEMETRY_AKS_RESOURCE_ID"
#    ACSResourceName = "$TELEMETRY_ACS_RESOURCE_NAME"
#    Region = "$TELEMETRY_AKS_REGION"

#kubelet-1
[[inputs.prometheus]]
  name_prefix="container.azm.ms/"
  ## An array of urls to scrape metrics from.
  urls = ["$CADVISOR_METRICS_URL"]
  fieldpass = ["$KUBELET_RUNTIME_OPERATIONS_METRIC", "$KUBELET_RUNTIME_OPERATIONS_ERRORS_METRIC", "$KUBELET_RUNTIME_OPERATIONS_TOTAL_METRIC", "$KUBELET_RUNTIME_OPERATIONS_ERRORS_TOTAL_METRIC"]

  metric_version = 2
  url_tag = "scrapeUrl"

  ## An array of Kubernetes services to scrape metrics from.
  # kubernetes_services = ["http://my-service-dns.my-namespace:9100/metrics"]

  ## Kubernetes config file to create client from.
  # kube_config = "/path/to/kubernetes.config"

  ## Scrape Kubernetes pods for the following prometheus annotations:
  ## - prometheus.io/scrape: Enable scraping for this pod
  ## - prometheus.io/scheme: If the metrics endpoint is secured then you will need to
  ##     set this to `https` & most likely set the tls config.
  ## - prometheus.io/path: If the metrics path is not /metrics, define it with this annotation.
  ## - prometheus.io/port: If port is not 9102 use this annotation
  # monitor_kubernetes_pods = true

  ## Use bearer token for authorization. ('bearer_token' takes priority)
  bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
  ## OR
  # bearer_token_string = "abc_123"

  ## Specify timeout duration for slower prometheus clients (default is 3s)
  timeout = "15s"

  ## Optional TLS Config
  tls_ca = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
  #tls_cert = /path/to/certfile
  # tls_key = /path/to/keyfile
  ## Use TLS but skip chain & host verification
  insecure_skip_verify = true
  #tagexclude = ["AgentVersion","AKS_RESOURCE_ID","ACS_RESOURCE_NAME", "Region", "ClusterName", "ClusterType", "Computer", "ControllerType"]
  [inputs.prometheus.tagpass]
    operation_type = ["create_container", "remove_container", "pull_image"]

#kubelet-2
[[inputs.prometheus]]
  name_prefix="container.azm.ms/"
  ## An array of urls to scrape metrics from.
  urls = ["$CADVISOR_METRICS_URL"]

  # <= 1.18: metric name is kubelet_running_pod_count
  # >= 1.19: metric name changed to kubelet_running_pods
  fieldpass = ["kubelet_running_pod_count","kubelet_running_pods","volume_manager_total_volumes", "kubelet_node_config_error", "process_resident_memory_bytes", "process_cpu_seconds_total"]

  metric_version = 2
  url_tag = "scrapeUrl"

  ## Use bearer token for authorization. ('bearer_token' takes priority)
  bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
  ## Specify timeout duration for slower prometheus clients (default is 3s)
  timeout = "15s"

  ## Optional TLS Config
  tls_ca = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
  insecure_skip_verify = true

## prometheus custom metrics
[[inputs.prometheus]]

  interval = "$AZMON_DS_PROM_INTERVAL"

  ## An array of urls to scrape metrics from.
  urls = $AZMON_DS_PROM_URLS

  fieldpass = $AZMON_DS_PROM_FIELDPASS

  fielddrop = $AZMON_DS_PROM_FIELDDROP

  metric_version = 2
  url_tag = "scrapeUrl"

  ## Kubernetes config file to create client from.
  # kube_config = "/path/to/kubernetes.config"

  ## Use bearer token for authorization. ('bearer_token' takes priority)
  bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
  ## OR
  # bearer_token_string = "abc_123"

  ## Specify timeout duration for slower prometheus clients (default is 3s)
  timeout = "15s"

  ## Optional TLS Config
  tls_ca = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
  #tls_cert = /path/to/certfile
  # tls_key = /path/to/keyfile
  ## Use TLS but skip chain & host verification
  insecure_skip_verify = true
  #tagexclude = ["AgentVersion","AKS_RESOURCE_ID","ACS_RESOURCE_NAME", "Region", "ClusterName", "ClusterType", "Computer", "ControllerType"]

  ## Default is 'namespace' but this can conflict with metrics that have the label 'namespace'
  pod_namespace_label_name = "pod_namespace"

##npm
[[inputs.prometheus]]
  #name_prefix="container.azm.ms/"
  ## An array of urls to scrape metrics from.
  urls = $AZMON_INTEGRATION_NPM_METRICS_URL_LIST_NODE

  metric_version = 2
  url_tag = "scrapeUrl"

  ## An array of Kubernetes services to scrape metrics from.
  # kubernetes_services = ["http://my-service-dns.my-namespace:9100/metrics"]

  ## Kubernetes config file to create client from.
  # kube_config = "/path/to/kubernetes.config"

  ## Scrape Kubernetes pods for the following prometheus annotations:
  ## - prometheus.io/scrape: Enable scraping for this pod
  ## - prometheus.io/scheme: If the metrics endpoint is secured then you will need to
  ##     set this to `https` & most likely set the tls config.
  ## - prometheus.io/path: If the metrics path is not /metrics, define it with this annotation.
  ## - prometheus.io/port: If port is not 9102 use this annotation
  # monitor_kubernetes_pods = true

  ## Use bearer token for authorization. ('bearer_token' takes priority)
  bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
  ## OR
  # bearer_token_string = "abc_123"

  ## Specify timeout duration for slower prometheus clients (default is 3s)
  timeout = "15s"

  ## Optional TLS Config
  tls_ca = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
  #tls_cert = /path/to/certfile
  # tls_key = /path/to/keyfile
  ## Use TLS but skip chain & host verification
  insecure_skip_verify = true
  #tagexclude = ["AgentVersion","AKS_RESOURCE_ID","ACS_RESOURCE_NAME", "Region", "ClusterName", "ClusterType", "Computer", "ControllerType"]
  #[inputs.prometheus.tagpass]
  #  operation_type = ["create_container", "remove_container", "pull_image"]

# [[inputs.exec]]
#   ## Commands array
#   interval = "15m"
#   commands = [
#     "/opt/microsoft/docker-cimprov/bin/TelegrafTCPErrorTelemetry.sh"
#   ]

#   ## Timeout for each command to complete.
#   timeout = "15s"

#   ## measurement name suffix (for separating different commands)
#   name_suffix = "_telemetry"

#   ## Data format to consume.
#   ## Each data format has its own unique set of configuration options, read
#   ## more about them here:
#   ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
#   data_format = "influx"
#   tagexclude = ["hostName"]
#   [inputs.exec.tags]
#     AgentVersion = "$AGENT_VERSION"
#     AKS_RESOURCE_ID = "$TELEMETRY_AKS_RESOURCE_ID"
#     ACS_RESOURCE_NAME = "$TELEMETRY_ACS_RESOURCE_NAME"
#     Region = "$TELEMETRY_AKS_REGION"
#     ClusterName = "$TELEMETRY_CLUSTER_NAME"
#     ClusterType = "$TELEMETRY_CLUSTER_TYPE"
#     Computer = "placeholder_hostname"
#     ControllerType = "$CONTROLLER_TYPE"

## ip subnet usage
[[inputs.prometheus]]
  #name_prefix="container.azm.ms/"
  ## An array of urls to scrape metrics from.
  urls = $AZMON_INTEGRATION_SUBNET_IP_USAGE_METRICS_URL_LIST_NODE

  metric_version = 2
  url_tag = "scrapeUrl"

  ## Use bearer token for authorization. ('bearer_token' takes priority)
  bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"

  ## Specify timeout duration for slower prometheus clients (default is 3s)
  timeout = "15s"

  ## Optional TLS Config
  tls_ca = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
  insecure_skip_verify = true

Logs from Telegraf

2024-04-25T17:05:00Z E! [inputs.prometheus] Error in plugin: error making HTTP request to "http://10.15.218.113:9102/metrics": Get "http://10.15.218.113:9102/metrics": dial tcp 10.15.218.113:9102: connect: connection refused
2024-04-25T17:05:00Z E! [inputs.prometheus] Error in plugin: error reading metrics for "http://10.15.218.96:15020/stats/prometheus": reading text format failed: text format parsing error in line 757: invalid metric name
2024-04-25T17:05:15Z E! [inputs.prometheus] Error in plugin: error making HTTP request to "http://10.15.218.103:15020/stats/prometheus": Get "http://10.15.218.103:15020/stats/prometheus": dial tcp 10.15.218.103:15020: i/o timeout (Client.Timeout exceeded while awaiting headers)
2024-04-25T17:05:15Z E! [inputs.prometheus] Error in plugin: error making HTTP request to "http://10.15.218.112:15020/stats/prometheus": Get "http://10.15.218.112:15020/stats/prometheus": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024-04-25T17:05:15Z E! [inputs.prometheus] Error in plugin: error making HTTP request to "http://10.15.218.117:15020/stats/prometheus": Get "http://10.15.218.117:15020/stats/prometheus": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024-04-25T17:05:15Z E! [inputs.prometheus] Error in plugin: error making HTTP request to "http://10.15.219.0:15020/stats/prometheus": Get "http://10.15.219.0:15020/stats/prometheus": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024-04-25T17:05:15Z E! [inputs.prometheus] Error in plugin: error making HTTP request to "http://10.15.219.189:15020/stats/prometheus": Get "http://10.15.219.189:15020/stats/prometheus": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024-04-25T17:05:15Z E! [inputs.prometheus] Error in plugin: error making HTTP request to "http://10.15.217.29:15020/stats/prometheus": Get "http://10.15.217.29:15020/stats/prometheus": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024-04-25T17:05:15Z E! [inputs.prometheus] Error in plugin: error making HTTP request to "http://10.15.218.109:15020/stats/prometheus": Get "http://10.15.218.109:15020/stats/prometheus": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024-04-25T17:05:15Z E! [inputs.prometheus] Error in plugin: error making HTTP request to "http://10.15.218.108:15020/stats/prometheus": Get "http://10.15.218.108:15020/stats/prometheus": dial tcp 10.15.218.108:15020: i/o timeout (Client.Timeout exceeded while awaiting headers)
2024-04-25T17:05:15Z E! [inputs.prometheus] Error in plugin: error making HTTP request to "http://10.15.218.115:15020/stats/prometheus": Get "http://10.15.218.115:15020/stats/prometheus": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024-04-25T17:05:15Z E! [inputs.prometheus] Error in plugin: error making HTTP request to "http://10.15.218.11:15020/stats/prometheus": Get "http://10.15.218.11:15020/stats/prometheus": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024-04-25T17:05:15Z E! [inputs.prometheus] Error in plugin: error making HTTP request to "http://10.15.217.57:15020/stats/prometheus": Get "http://10.15.217.57:15020/stats/prometheus": dial tcp 10.15.217.57:15020: i/o timeout (Client.Timeout exceeded while awaiting headers)
2024-04-25T17:05:15Z E! [inputs.prometheus] Error in plugin: error making HTTP request to "http://10.15.218.94:15020/stats/prometheus": Get "http://10.15.218.94:15020/stats/prometheus": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024-04-25T17:05:15Z E! [inputs.prometheus] Error in plugin: error making HTTP request to "http://10.15.217.253:15020/stats/prometheus": Get "http://10.15.217.253:15020/stats/prometheus": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024-04-25T17:05:15Z E! [inputs.prometheus] Error in plugin: error making HTTP request to "http://10.15.217.28:15020/stats/prometheus": Get "http://10.15.217.28:15020/stats/prometheus": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

System info

Telegraf 1.28.5 Istio 1.21

Docker

No response

Steps to reproduce

Deploy work load with the telegraf config and Isto sample app.
Enabled Prometheus merge and check the metrics collected.
Compare the metrics collected with Prometheus merge on and off to verify that the metrics collecting having issues. ...

Expected behavior

All metrics are collected when using Istio and Prometheus merge.

Actual behavior

With Prometheus merge enabled and when using with Istio version 1.21+, some metrics were missing. There were some compatibility issues for Telegraf with Istio 1.21+ which would generate errors including format, parsing and so on.

Additional info

No response

powersj commented 3 months ago

Let's take a look at each log line you provided:

dial tcp 10.15.218.113:9102: connect: connection refused

That is not a Telegraf issue. Your endpoint is not accepting connections. What would you expect Telegraf do about this?

"http://10.15.218.96:15020/stats/prometheus": reading text format failed: text format parsing error in line 757: invalid metric name

You need to go load that file up, go to line 757 and read what the metric name is and figure out why it is invalid. Probably not a Telegraf issue.

context deadline exceeded (Client.Timeout exceeded while awaiting headers)

This is the rest of the log lines. We have a comment about this in our FAQ. Namely something in your networking is having issues. You went and upgraded your networking, so I have a good idea of where you might go looking.

Nothing from the above points to a required change in Telegraf. If you do think something needs an update, can you please:

Update your version of telegraf, which is probably the most important thing
Provide additional information as to which config you have configured is getting errors. For example, add an alias to each input to help figure out which input is running into which error. You have IPs in your logs, but I have no way of knowing which IP corresponds to which input
At the very least get the prometehus status that Telegraf cannot parse so we can reproduce the issue.

vijaytdh commented 3 months ago

Thanks @wanlonghenry for raising this. We (well the developers in our business that I support) are experiencing this issue but let me clarify a few things and provide some more context:

The problem is not about the lines were there is a connection, refused, i/o timeout etc - these are probably cases where the workload does not have an appropriate network policy or istio policies
The problem is about the error "text format parsing error in line 757: invalid metric nametext format parsing error in line 757: invalid metric name"
The scenario is this:
- Kubernetes cluster (AKS) v1.29.2, with Istio 1.21 and Prometheus Metrics merge enabled as described here https://istio.io/latest/docs/ops/integrations/prometheus/#configuration
- The workload has a Prometheus metrics endpoint, there is a telegraf side car that runs with the workload - this scrapes the metrics, filters some metrics and exposes them on another port.
  - The telegraf side car is running telegraf:1.27-alpine.
  - It has the following config:
```
[[inputs.prometheus]]
urls = ["http://127.0.0.1:8090/"]
[[outputs.prometheus_client]]
listen = ":8080"
path = "/metrics"
collectors_exclude = ["gocollector","process"]
```
- With Prometheus metrics merge enabled Istio will present a unified endpoint on :15020/stats/prometheus that has both the application and Istio internal metrics
- The AKS Azure Monitor Agent, then scrapes these metrics based on the defacto standard prometheus.io annotations - this uses Telegraf itself (not sure which version) - this is where we see the invalid metric name error

When we see an error similar to "text format parsing error in line 807: invalid metric nametext format parsing error in line 807: invalid metric name", we can curl the endpoint and see that this line just has "# HELP process_start_time_seconds Telegraf Collected metric"

Example metrics attached here (: prom-metrics-merge.txt

As far as I can tell there are no special characters on that line.

Prior to this line there are other help lines but for some reason it has an error with this particular one. If we pipe the output to "promtool check metrics" - it is able to parse the metrics. It does have some warnings about some of the Istio metrics not having help text - but I don't think those would cause an invalid metric name error, also all the metrics it has warnings for are Istio ones which are being scraped.

We did try changing the telegraf config for the workload to use metrics version 1 / 2 but that doesn't seem to help. To be honest it's really hard to figure out what the issue is, whether it's Istio, the telegraf sidecar, the AMA telegraf or a combination of these and the settings associated with them.

powersj commented 3 months ago

The problem is about the error "text format parsing error in line 757: invalid metric nametext format parsing error in line 757: invalid metric name" Example metrics attached here (:

Thank you for clarifying what the issue is and providing the prometheus metrics. Is the line number and corresponding metric name the same for every deployment? Or does it vary? Is it always the last line of the file?

Prior to this line there are other help lines but for some reason it has an error with this particular one. If we pipe the output to "promtool check metrics" - it is able to parse the metrics.

Telegraf users the upstream Prometheus library to parse the data. In this case, invalid metric name comes from github.com/prometheus/common here. A valid metric name is required to match [a-zA-Z_:][a-zA-Z0-9_:]*.

It does have some warnings about some of the Istio metrics not having help text - but I don't think those would cause an invalid metric name error, also all the metrics it has warnings for are Istio ones which are being scraped.

Agreed, not having help text would not stop Telegraf from reading the metrics. You can try this and should still see the metrics.

If I use the following config:

[agent]
  debug = true
  omit_hostname = true

[[inputs.prometheus]]
  urls = ["http://127.0.0.1:8000/prom-metrics-merge.txt"]

[[outputs.file]]

I have tried parsing the metrics you provided with both v1.28.5 as well as master and neither produce any errors or warnings. Using your same config with the Prometheus Client output also resolves the lines as expected:

[agent]
  debug = true
  omit_hostname = true

[[inputs.prometheus]]
  urls = ["http://127.0.0.1:8000/prom-metrics-merge.txt"]

[[outputs.prometheus_client]]
  listen = ":8080"
  path = "/metrics"
  collectors_exclude = ["gocollector","process"]

$ ../telegraf-builds/telegraf-v1.28.5 --config config.toml
2024-06-06T14:04:49Z I! Loading config: config.toml
2024-06-06T14:04:49Z I! Starting Telegraf 1.28.5 brought to you by InfluxData the makers of InfluxDB
2024-06-06T14:04:49Z I! Available plugins: 240 inputs, 9 aggregators, 29 processors, 24 parsers, 59 outputs, 5 secret-stores
2024-06-06T14:04:49Z I! Loaded inputs: prometheus
2024-06-06T14:04:49Z I! Loaded aggregators: 
2024-06-06T14:04:49Z I! Loaded processors: 
2024-06-06T14:04:49Z I! Loaded secretstores: 
2024-06-06T14:04:49Z I! Loaded outputs: prometheus_client
2024-06-06T14:04:49Z I! Tags enabled: 
2024-06-06T14:04:49Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"", Flush Interval:10s
2024-06-06T14:04:49Z D! [agent] Initializing plugins
2024-06-06T14:04:49Z I! [inputs.prometheus] Using the label selector:  and field selector: 
2024-06-06T14:04:49Z D! [agent] Connecting outputs
2024-06-06T14:04:49Z D! [agent] Attempting connection to [outputs.prometheus_client]
2024-06-06T14:04:49Z I! [outputs.prometheus_client] Listening on http://[::]:8080/metrics
2024-06-06T14:04:49Z D! [agent] Successfully connected to outputs.prometheus_client
2024-06-06T14:04:49Z D! [agent] Starting service inputs
2024-06-06T14:04:59Z D! [outputs.prometheus_client] Wrote batch of 278 metrics in 1.984879ms
2024-06-06T14:04:59Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2024-06-06T14:05:09Z D! [outputs.prometheus_client] Wrote batch of 278 metrics in 1.671008ms
2024-06-06T14:05:09Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics

telegraf-tiger[bot] commented 2 months ago

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!

influxdata / telegraf