influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

procstat metric not populated on FreeBSD arm64 #13933

Open sdalu opened 1 year ago

sdalu commented 1 year ago

Relevant telegraf.conf

[[inputs.procstat]]
  exe                   = "influxd"
  tagexclude            = [ "pid_finder", "exe", "pidfile" ]

[[inputs.procstat]]
  exe                   = "telegraf"
  tagexclude            = [ "pid_finder", "exe", "pidfile" ]

Logs from Telegraf

2023-09-15T21:17:42Z I! Loading config: /usr/local/etc/telegraf.conf
2023-09-15T21:17:42Z I! Starting Telegraf unknown brought to you by InfluxData the makers of InfluxDB
2023-09-15T21:17:42Z I! Available plugins: 240 inputs, 9 aggregators, 29 processors, 24 parsers, 59 outputs, 4 secret-stores
2023-09-15T21:17:42Z I! Loaded inputs: procstat (10x)
2023-09-15T21:17:42Z I! Loaded aggregators: 
2023-09-15T21:17:42Z I! Loaded processors: converter override (4x) regex
2023-09-15T21:17:42Z I! Loaded secretstores: 
2023-09-15T21:17:42Z I! Loaded outputs: influxdb_v2
2023-09-15T21:17:42Z I! Tags enabled: host=brain.home.sdalu.com
2023-09-15T21:17:42Z I! [agent] Config: Interval:20s, Quiet:false, Hostname:"brain.home.sdalu.com", Flush Interval:30s
2023-09-15T21:17:42Z D! [agent] Initializing plugins
2023-09-15T21:17:42Z D! [agent] Connecting outputs
2023-09-15T21:17:42Z D! [agent] Attempting connection to [outputs.influxdb_v2]
2023-09-15T21:17:42Z D! [agent] Successfully connected to outputs.influxdb_v2
2023-09-15T21:17:42Z D! [agent] Starting service inputs
2023-09-15T21:18:13Z D! [outputs.influxdb_v2] Wrote batch of 10 metrics in 243.505683ms
2023-09-15T21:18:13Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics

System info

Telegraf 1.28.0 FreeBSD 13.2 arm64

Docker

No response

Steps to reproduce

  1. run the configuration

Expected behavior

Some procstat and procstat_lookup metrics, like:

> procstat,host=rork.home.sdalu.com,org_destination=IT,process_name=telegraf,user=telegraf cpu_time_guest=0,cpu_time_guest_nice=0,cpu_time_idle=0,cpu_time_iowait=0,cpu_time_irq=0,cpu_time_nice=0,cpu_time_soft_irq=0,cpu_time_steal=0,cpu_time_system=2654.42098,cpu_time_user=816.839403,cpu_usage=0.035786768916299505,created_at=1694607213273000000i,memory_data=0i,memory_locked=0i,memory_rss=193368064i,memory_stack=0i,memory_swap=0i,memory_usage=0.37625008821487427,memory_vms=5346725888i,num_threads=32i,pid=2422i,ppid=2421i,read_bytes=0i,read_count=1115i,write_bytes=0i,write_count=0i 1694812868000000000
> procstat_lookup,host=rork.home.sdalu.com,org_destination=IT,result=success pid_count=1i,result_code=0i,running=1i 1694812868000000000

Actual behavior

Only procstat_lookup is generated, no procstat

Additional info

Don't know if behaviour is specific to arm64 or the whole arm family. Tested on amd64, and it's working fine, so it is not specific to FreeBSD

powersj commented 1 year ago

Hi,

As I mentioned in the previous issue we do not provide an arm64 build for FreeBSD, so to work on this I would need you to build and test PRs.

The procstat metric is generated in the addMetric function here.

procstat_lookup,host=rork.home.sdalu.com,org_destination=IT,result=success pid_count=1i,result_code=0i,running=1i 1694812868000000000

Is this the actual output of procstat_lookup? Before diving much further into this I want to be certain that running is actually non-zero. If it is zero, then there will be no procstat metric generated or if there were any errors updating the processes.

Finally, the data is all gathered via gopsutil's library, so I think we should try outside of telegraf as well.

If you create a directory and create two files:

main.go:

package main

import (
    "fmt"
    "os"

    "github.com/shirou/gopsutil/process"
)

func main() {
    currentPid := os.Getpid()
    myself, err := process.NewProcess(int32(currentPid))
    if err != nil {
        panic(err)
    }
    fmt.Println(myself.Name())
    fmt.Println(myself.String())
    fmt.Println(myself.NumThreads())
    fmt.Println(myself.RlimitUsage(true))
    fmt.Println(myself.Status())
}

go.mod - replace the go version with whatever you have locally:

module test-process

go 1.21

And either run this directly via go run . or build it go build . and run the test-process binary.

telegraf-tiger[bot] commented 1 year ago

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!

sdalu commented 1 year ago

Sorry for the late answer

procstat_lookup,host=rork.home.sdalu.com,org_destination=IT,result=success pid_count=1i,result_code=0i,running=1i 1694812868000000000

Is this the actual output of procstat_lookup? Before diving much further into this I want to be certain that running is actually non-zero. If it is zero, then there will be no procstat metric generated or if there were any errors updating the processes.

Yes that's actual output

Finally, the data is all gathered via gopsutil's library, so I think we should try outside of telegraf as well. [...] And either run this directly via go run . or build it go build . and run the test-process binary.

Output is:

 <nil>
{"pid":91329}
0 <nil>
[] not implemented yet
 <nil>
powersj commented 1 year ago

Gopsutil is providing a nil name and other metrics, which means we are skipping the process. Here is the code in Telegraf, which checks for the nil name and commets that if this is nil we assume we are not getting anything else. Which based on the output, seems to also return default values or nil.

I would suggest an upstream issue as part of the gopsutil project to get this added or enabled there. You can use the example code I provided in my previous comment of a way to reproduce.

powersj commented 5 months ago

@sdalu,

I have put up https://github.com/influxdata/telegraf/pull/15272 which includes an update to gopsutil library. Your upstream issue appears to have been fixed back in March so it is likely that our last release already has this fix. Could you please download artifacts from that PR, which will be attached as a comment ~30mins from this message, and let me know if this resolves this issue?

Thanks!

sdalu commented 5 months ago

I downloaded telegraf-1.31.0~553d972c_freebsd_armv7.tar.gz and run

./telegraf-1.31.0/usr/bin/telegraf --config /usr/local/etc/telegraf.conf --debug

Got a panic

2024-05-02T13:09:40Z E! FATAL: [inputs.procstat] panicked: runtime error: invalid memory address or nil pointer dereference, Stack:
goroutine 147 [running]:
github.com/influxdata/telegraf/agent.panicRecover(0x4d410370)
    /go/src/github.com/influxdata/telegraf/agent/agent.go:1202 +0x74
panic({0x67aa400, 0xc587b20})
    /usr/local/go/src/runtime/panic.go:770 +0xfc
github.com/shirou/gopsutil/v3/process.(*Process).createTimeWithContext(0x4d0a0368, {0x8232a44, 0xc9983c0})
    /go/pkg/mod/github.com/shirou/gopsutil/v3@v3.24.4/process/process_freebsd.go:121 +0x4c
github.com/shirou/gopsutil/v3/process.(*Process).CreateTimeWithContext(0x4d0a0368, {0x8232a44, 0xc9983c0})
    /go/pkg/mod/github.com/shirou/gopsutil/v3@v3.24.4/process/process.go:310 +0x74
github.com/shirou/gopsutil/v3/process.NewProcessWithContext({0x8232a44, 0xc9983c0}, 0x3744)
    /go/pkg/mod/github.com/shirou/gopsutil/v3@v3.24.4/process/process.go:218 +0x78
github.com/shirou/gopsutil/v3/process.NewProcess(...)
    /go/pkg/mod/github.com/shirou/gopsutil/v3@v3.24.4/process/process.go:203
github.com/influxdata/telegraf/plugins/inputs/procstat.newProc(0x3744)
    /go/src/github.com/influxdata/telegraf/plugins/inputs/procstat/process.go:38 +0x30
github.com/influxdata/telegraf/plugins/inputs/procstat.(*Procstat).gatherOld(0x4ccc6e48, {0x824a858, 0x4d40cae0})
    /go/src/github.com/influxdata/telegraf/plugins/inputs/procstat/procstat.go:209 +0x848
github.com/influxdata/telegraf/plugins/inputs/procstat.(*Procstat).Gather(0x4ccc6e48, {0x824a858, 0x4d40cae0})
    /go/src/github.com/influxdata/telegraf/plugins/inputs/procstat/procstat.go:166 +0x38
github.com/influxdata/telegraf/models.(*RunningInput).Gather(0x4d410370, {0x824a858, 0x4d40cae0})
    /go/src/github.com/influxdata/telegraf/models/running_input.go:227 +0x2c4
github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1()
    /go/src/github.com/influxdata/telegraf/agent/agent.go:583 +0x70
created by github.com/influxdata/telegraf/agent.(*Agent).gatherOnce in goroutine 120
    /go/src/github.com/influxdata/telegraf/agent/agent.go:581 +0xc0

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0x4d310b68)
    /usr/local/go/src/runtime/sema.go:62 +0x3c
sync.(*WaitGroup).Wait(0
2024-05-02T13:09:40Z E! PLEASE REPORT THIS PANIC ON GITHUB with stack trace, configuration, and OS information: https://github.com/influxdata/telegraf/issues/new/choose
powersj commented 5 months ago

Well that's no good! Can you file a second upstream issue please with that stack trace. It does appear that gopsutil's createTimeWithContext function is the cause of the crash.