influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.91k stars 5.6k forks source link

Telegraf panic with OpenStack input plugin #13999

Closed llossinxw closed 1 year ago

llossinxw commented 1 year ago

Relevant telegraf.conf

[global_tags]
[agent]
interval = "15s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "5s"
flush_jitter = "0s"
precision = ""
debug = true
quiet = false
hostname = ""
omit_hostname = false

[[inputs.openstack]]
interval = "30m"
## The identity endpoint to authenticate against and get the service catalog from.
authentication_endpoint = "http://10.0.2.126:5000/v3/"

## The domain to authenticate against when using a V3 identity endpoint.
domain = "Default"
## The project to authenticate as.
project = "adminProject"
## User authentication credentials. Must have admin rights. 
username = "adminUser"
password = "adminPasswd"

## Available services are:
## "agents", "aggregates", "flavors", "hypervisors", "networks", "nova_services",
## "ports", "projects", "servers", "services", "stacks", "storage_pools", "subnets", "volumes"
enabled_services = ["agents", "aggregates", "flavors", "hypervisors", "networks", "nova_services", "ports", "projects", "servers", "services", "stacks", "storage_pools", "subnets", "volumes"]
# NOTE: if stacks, storage_pools or volumes in enabled services -> SEGMENTATION (No HEAT, CINDER installed)

## Collect Server Diagnostics
server_diagnotics = false
## output secrets (such as adminPass(for server) and UserID(for volume)).
output_secrets = false
## Amount of time allowed to complete the HTTP(s) request.
timeout = "10s"

## HTTP Proxy support
# http_proxy_url = ""
## Optional TLS Config
# tls_ca = /path/to/cafile
# tls_cert = /path/to/certfile
# tls_key = /path/to/keyfile
## Use TLS but skip chain & host verification
# insecure_skip_verify = false

## Options for tags received from Openstack
# tag_prefix = "openstack_tag_"
# tag_value = "true"

## Timestamp format for timestamp data received from Openstack.
## If false format is unix nanoseconds.
human_readable_timestamps = false

## Measure Openstack call duration
# measure_openstack_requests = false

[[outputs.file]]
files = ["stdout"]
tagexclude = [ "url", ]

Logs from Telegraf

telegraf_openstack_1  | panic: runtime error: invalid memory address or nil pointer dereference
telegraf_openstack_1  | [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x273c2f3]
telegraf_openstack_1  | 
telegraf_openstack_1  | goroutine 30 [running]:
telegraf_openstack_1  | github.com/gophercloud/gophercloud.(*ServiceClient).ResourceBaseURL(...)
telegraf_openstack_1  |     /go/pkg/mod/github.com/gophercloud/gophercloud@v0.16.0/service_client.go:39
telegraf_openstack_1  | github.com/gophercloud/gophercloud.(*ServiceClient).ServiceURL(...)
telegraf_openstack_1  |     /go/pkg/mod/github.com/gophercloud/gophercloud@v0.16.0/service_client.go:47
telegraf_openstack_1  | github.com/gophercloud/gophercloud/openstack/orchestration/v1/stacks.createURL(0x1)
telegraf_openstack_1  |     /go/pkg/mod/github.com/gophercloud/gophercloud@v0.16.0/openstack/orchestration/v1/stacks/urls.go:6 +0x33
telegraf_openstack_1  | github.com/gophercloud/gophercloud/openstack/orchestration/v1/stacks.listURL(...)
telegraf_openstack_1  |     /go/pkg/mod/github.com/gophercloud/gophercloud@v0.16.0/openstack/orchestration/v1/stacks/urls.go:14
telegraf_openstack_1  | github.com/gophercloud/gophercloud/openstack/orchestration/v1/stacks.List(0x0, {0x57a38c0, 0xc0001b8cc0})
telegraf_openstack_1  |     /go/pkg/mod/github.com/gophercloud/gophercloud@v0.16.0/openstack/orchestration/v1/stacks/requests.go:282 +0x71
telegraf_openstack_1  | github.com/influxdata/telegraf/plugins/inputs/openstack.(*OpenStack).gatherStacks(0xc000864c80, {0x58c8a98, 0xc000a0ad40})
telegraf_openstack_1  |     /go/src/github.com/influxdata/telegraf/plugins/inputs/openstack/openstack.go:325 +0x71
telegraf_openstack_1  | github.com/influxdata/telegraf/plugins/inputs/openstack.(*OpenStack).Gather(0xc000864c80, {0x58c8a98, 0xc000a0ad40})
telegraf_openstack_1  |     /go/src/github.com/influxdata/telegraf/plugins/inputs/openstack/openstack.go:281 +0xb93
telegraf_openstack_1  | github.com/influxdata/telegraf/models.(*RunningInput).Gather(0xc000818280, {0x58c8a98, 0xc000a0ad40})
telegraf_openstack_1  |     /go/src/github.com/influxdata/telegraf/models/running_input.go:117 +0x5a
telegraf_openstack_1  | github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1()
telegraf_openstack_1  |     /go/src/github.com/influxdata/telegraf/agent/agent.go:469 +0x2e
telegraf_openstack_1  | created by github.com/influxdata/telegraf/agent.(*Agent).gatherOnce
telegraf_openstack_1  |     /go/src/github.com/influxdata/telegraf/agent/agent.go:468 +0x12f
openstack_datasource_telegraf_openstack_1 exited with code 2

System info

Telegraf 1.21.2-alpine, Ubuntu 22.04.3 LTS

Docker

version: '3.3' services: telegraf_openstack: image: telegraf:1.21.2-alpine hostname: telegraf_openstack extra_hosts:

Steps to reproduce

  1. Create directory with the following struct:
    telegraf_openstack_panic_bug
    ├── config
    │   └── telegraf_openstack
    │     └── telegraf.conf
    └── docker-compose.yml
  2. Copy Telegraf configuration file in telegraf_openstack_panic_bug/config/telegraf_openstack/telegraf.conf
  3. Update the configuration with valid authentication_endpoint, domain, project, username, password
  4. Run docker-compose up command

Expected behavior

The Telegraf container should collect all the metrics coming from the available enabled services by polling the OpenStack APIs, skipping eventual non-available services.

Actual behavior

The Telegraf container stops after the first interval returning a panic segmentation fault with the logs included above.

Additional info

Different tests have been performed by enabling the possible enabled_services allowed values (i.e. "agents", "aggregates", "flavors", "hypervisors", "networks", "nova_services", "ports", "projects", "servers", "services", "stacks", "storage_pools", "subnets", "volumes") one by one.

The panic occurs when the enabled_services field in the Telegraf configuration file includes ar least one of "stacks", "storage_pools" or "volumes" values.

My guess is that the reason of this bug is that the Heat and the Cinder OpenStack modules (in charge of returning the "stacks" and the "storage_pools" & "volumes" metrics, respectively) are not enabled in the OpenStack instance under collection.

srebhan commented 1 year ago

@llossinxw can you please try a more recent version of telegraf e.g. v1.28.1?

llossinxw commented 1 year ago

Hi @srebhan I already tried with Telegraf v1.28.1-alpine. The error is still present but the logs are less verbose.

telegraf_openstack_1  | panic: runtime error: invalid memory address or nil pointer dereference
telegraf_openstack_1  | [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x273c2f3]
openstack_datasource_telegraf_openstack_1 exited with code 2
srebhan commented 1 year ago

Thanks for trying, will investigate but I might need your help to hunt that one down...

srebhan commented 1 year ago

@llossinxw I do have a theory... Does the crash directly happen at the first Gather() interval or only after some time? If the former, can you please add "orchestration" to the enabled_services option and let me know if this fixes the issue!?

llossinxw commented 1 year ago

@srebhan I guess that the crash is happening at the first Gather() since no metric is flushed towards the output plugin according to the logs. telegraf_openstack.log

srebhan commented 1 year ago

@llossinxw can you please test the binary available in PR #14011 once CI finished all tests successfully! Please let me know if this fixes your issue.

Heads-up: You will probably see one or more warnings of the form

W! "Disabling "stacks" service because orchestration is not available at the endpoint!

as the problem is that your endpoint need to provide orchestration for a client to query stacks...

llossinxw commented 1 year ago

@srebhan I just tried the new binary. It works as expected, thank you! The logs are these

Just a question, I am running Telegraf inside a Docker container with Docker Swarm and I am pulling the Telegraf official image from Dockerhub. Will this updated openstack input plugin will be part of Telegraf version 1.29? That is, will I have to update the image to telegraf:1.29 when it will be released to exploit this updated input plugin? Sorry for the newbie question

lukasmrtvy commented 1 year ago

@srebhan hi, seems its broken somehow, getting (1.28.3):

2023-10-28T17:53:44Z W! [inputs.openstack] Disabling "cinder_services" service because block-storage is not available at the endpoint!
2023-10-28T17:53:44Z W! [inputs.openstack] Disabling "storage_pools" service because block-storage is not available at the endpoint!
2023-10-28T17:53:44Z W! [inputs.openstack] Disabling "volumes" service because block-storage is not available at the endpoint!

with:

  enabled_services = ["agents", "aggregates", "cinder_services", "flavors", "hypervisors", "networks", "nova_services", "ports", "projects", "servers", "services", "storage_pools", "subnets", "volumes"]

and Openstack 2023.1, of course, I have enabled every service except heat.

lukasmrtvy commented 1 year ago

@srebhan hasBlockStorage = true is missing in https://github.com/influxdata/telegraf/blob/master/plugins/inputs/openstack/openstack.go#L200

There is another problem, typo here https://github.com/influxdata/telegraf/blob/master/plugins/inputs/openstack/openstack.go#L788, should be f o.services["servers"] {

EDIT: Another: vcpus, disk_gb, ram_mb, project labels are not populated correctly in openstack_server metric.

Someone should take a look at this plugin, cuz its not working correctly.

powersj commented 1 year ago

@lukasmrtvy please file a new issue please, rather than commenting on a closed one.