open-falcon / falcon-plus

An open-source and enterprise-level monitoring system.
Apache License 2.0
7.27k stars 1.53k forks source link

部分主机port监控数据显示为NaN #926

Closed wbcmax closed 3 years ago

wbcmax commented 3 years ago

主机信息

Name Kernel Version Agent Version
Amazon Linux 2 4.14.154-128.181.amzn2.x86_64 5.1.2

问题描述

部分主机除了一些端口监控数据显示为NaN, 其余数据如cpu、agent.alive都正常

观察agent.log发现以下报错:

$ grep '2021/04/01' /data/monitor/agent/logs/agent.log | grep 'exit status 127' | tail
2021/04/01 17:53:03 portstat.go:22: exit status 127
2021/04/01 17:53:03 sockstat.go:12: exit status 127
2021/04/01 17:54:03 portstat.go:22: exit status 127
2021/04/01 17:54:03 sockstat.go:12: exit status 127
2021/04/01 17:55:03 portstat.go:22: exit status 127
2021/04/01 17:55:03 sockstat.go:12: exit status 127
2021/04/01 17:56:03 portstat.go:22: exit status 127
2021/04/01 17:56:03 sockstat.go:12: exit status 127
2021/04/01 17:57:03 portstat.go:22: exit status 127
2021/04/01 17:57:03 sockstat.go:12: exit status 127

重启agent之后端口数据获取正常:

2021/04/01 18:01:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617271292, Value:1>
2021/04/01 18:02:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617271352, Value:1>
2021/04/01 18:03:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617271412, Value:1>
2021/04/01 18:04:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617271472, Value:1>
2021/04/01 18:05:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617271532, Value:1>
2021/04/01 18:06:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617271592, Value:1>
2021/04/01 18:07:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617271652, Value:1>
2021/04/01 18:08:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617271712, Value:1>
2021/04/01 18:09:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617271772, Value:1>
2021/04/01 18:10:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617271832, Value:1>
2021/04/01 18:11:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617271892, Value:1>
2021/04/01 18:12:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617271952, Value:1>
2021/04/01 18:13:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617272012, Value:1>
2021/04/01 18:14:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617272072, Value:1>
2021/04/01 18:15:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617272132, Value:1>
2021/04/01 18:16:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617272192, Value:1>
2021/04/01 18:17:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617272252, Value:1>
2021/04/01 18:18:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617272312, Value:1>
2021/04/01 18:19:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617272372, Value:1>
2021/04/01 18:20:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617272432, Value:1>
2021/04/01 18:21:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617272492, Value:1>
2021/04/01 18:22:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617272552, Value:1>
2021/04/01 18:23:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617272612, Value:1>
2021/04/01 18:24:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617272672, Value:1>
2021/04/01 18:25:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617272732, Value:1>
2021/04/01 18:26:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617272792, Value:1>
2021/04/01 18:27:32 var.go:88: => <Total=9> <Endpoint:MyHost, Metric:net.port.listen, Type:GAUGE, Tags:port=22, Step:60, Time:1617272852, Value:1>

重启时间是在17:57, 17:57之后没有exit status 127这种类似的报错了

$ grep '2021/04/01' /data/monitor/agent/logs/agent.log | grep 'exit status 127' | tail
2021/04/01 17:53:03 portstat.go:22: exit status 127
2021/04/01 17:53:03 sockstat.go:12: exit status 127
2021/04/01 17:54:03 portstat.go:22: exit status 127
2021/04/01 17:54:03 sockstat.go:12: exit status 127
2021/04/01 17:55:03 portstat.go:22: exit status 127
2021/04/01 17:55:03 sockstat.go:12: exit status 127
2021/04/01 17:56:03 portstat.go:22: exit status 127
2021/04/01 17:56:03 sockstat.go:12: exit status 127
2021/04/01 17:57:03 portstat.go:22: exit status 127
2021/04/01 17:57:03 sockstat.go:12: exit status 127

问题排查

下载了falcon-plus的源码, 尝试本地测试下portstat.gosockstat.go的输出, 没有遇到类似的报错

$ find . -name "portstat.go"
./modules/agent/funcs/portstat.go
./vendor/github.com/toolkits/nux/portstat.go

$ find . -name "sockstat.go"
./modules/agent/funcs/sockstat.go

本地修改./vendor/github.com/toolkits/nux/portstat.go

package main

import (
        "bufio"
        "bytes"
        "fmt"
        "io"
        "strconv"
        "strings"

        "github.com/toolkits/file"
        "github.com/toolkits/slice"
        "github.com/toolkits/sys"
)

// ListeningPorts 为了兼容老代码
func ListeningPorts() ([]int64, error) {
        return TcpPorts()
}

func TcpPorts() ([]int64, error) {
        return listeningPorts("sh", "-c", "ss -t -l -n")
}

func UdpPorts() ([]int64, error) {
        return listeningPorts("sh", "-c", "ss -u -a -n")
}

func listeningPorts(name string, args ...string) ([]int64, error) {
        ports := []int64{}

        bs, err := sys.CmdOutBytes(name, args...)
        if err != nil {
                return ports, err
        }

        reader := bufio.NewReader(bytes.NewBuffer(bs))

        // ignore the first line
        line, err := file.ReadLine(reader)
        if err != nil {
                return ports, err
        }

        for {
                line, err = file.ReadLine(reader)
                if err == io.EOF {
                        err = nil
                        break
                } else if err != nil {
                        return ports, err
                }

                fields := strings.Fields(string(line))
                fieldsLen := len(fields)

                if fieldsLen != 4 && fieldsLen != 5 {
                        return ports, fmt.Errorf("output of %s format not supported", name)
                }

                portColumnIndex := 2
                if fieldsLen == 5 {
                        portColumnIndex = 3
                }

                location := strings.LastIndex(fields[portColumnIndex], ":")
                port := fields[portColumnIndex][location+1:]

                if p, e := strconv.ParseInt(port, 10, 64); e != nil {
                        return ports, fmt.Errorf("parse port to int64 fail: %s", e.Error())
                } else {
                        ports = append(ports, p)
                }

        }

        return slice.UniqueInt64(ports), nil
}

func main() {
        fmt.Println(ListeningPorts())
}

使用go run portstat.go的方式运行了下, 结果如下:

$ go run portstat.go
[40005 40009 19087 9019 40008 2347 40015 40001 2333 2345 6379 19090 19093 2080 40011 1988 40016 40002 40003 40006 40010 40014 22 2048 3306 9002 40007 2346 40013 9001 80 40004 135] <nil>

本地修改了下./modules/agent/funcs/portstat.go

package main

import (
        "fmt"
        "github.com/open-falcon/falcon-plus/common/model"
        "github.com/open-falcon/falcon-plus/modules/agent/g"
        "github.com/toolkits/nux"
        "github.com/toolkits/slice"
        "log"
        "strings"
)

func NewMetricValue(metric string, val interface{}, dataType string, tags ...string) *model.MetricValue {
        mv := model.MetricValue{
                Metric: metric,
                Value:  val,
                Type:   dataType,
        }

        size := len(tags)

        if size > 0 {
                mv.Tags = strings.Join(tags, ",")
        }

        return &mv
}

func GaugeValue(metric string, val interface{}, tags ...string) *model.MetricValue {
        return NewMetricValue(metric, val, "GAUGE", tags...)
}

func CounterValue(metric string, val interface{}, tags ...string) *model.MetricValue {
        return NewMetricValue(metric, val, "COUNTER", tags...)
}

func PortMetrics() (L []*model.MetricValue) {

        reportPorts := g.ReportPorts()
        sz := len(reportPorts)
        if sz == 0 {
                return
        }

        allTcpPorts, err := nux.TcpPorts()
        if err != nil {
                log.Println(err)
                return
        }

        allUdpPorts, err := nux.UdpPorts()
        if err != nil {
                log.Println(err)
                return
        }

        for i := 0; i < sz; i++ {
                tags := fmt.Sprintf("port=%d", reportPorts[i])
                if slice.ContainsInt64(allTcpPorts, reportPorts[i]) || slice.ContainsInt64(allUdpPorts, reportPorts[i]) {
                        L = append(L, GaugeValue(g.NET_PORT_LISTEN, 1, tags))
                } else {
                        L = append(L, GaugeValue(g.NET_PORT_LISTEN, 0, tags))
                }
        }

        return
}

func main() {
        fmt.Println(PortMetrics())
}

使用go run portstat.go的方式运行了下, 结果如下:

$ go run portstat.go
[]

shell命令测试

$ sh -c "ss -t -l -n"
State                  Recv-Q                  Send-Q                                    Local Address:Port                                    Peer Address:Port
LISTEN                 0                       128                                             0.0.0.0:22                                           0.0.0.0:*
LISTEN                 0                       128                                           127.0.0.1:9019                                         0.0.0.0:*
LISTEN                 0                       128                                           127.0.0.1:2333                                         0.0.0.0:*
LISTEN                 0                       128                                           127.0.0.1:2048                                         0.0.0.0:*
LISTEN                 0                       2048                                          127.0.0.1:9001                                         0.0.0.0:*
LISTEN                 0                       2048                                            0.0.0.0:2345                                         0.0.0.0:*
LISTEN                 0                       50                                              0.0.0.0:3306                                         0.0.0.0:*
LISTEN                 0                       50                                            127.0.0.1:9002                                         0.0.0.0:*
LISTEN                 0                       128                                           127.0.0.1:6379                                         0.0.0.0:*
LISTEN                 0                       128                                             0.0.0.0:80                                           0.0.0.0:*
LISTEN                 0                       128                                                   *:19090                                              *:*
LISTEN                 0                       128                                                   *:19093                                              *:*
LISTEN                 0                       128                                                [::]:22                                              [::]:*
LISTEN                 0                       65535                                                 *:2080                                               *:*
LISTEN                 0                       128                                                   *:40001                                              *:*
LISTEN                 0                       128                                                   *:40002                                              *:*
LISTEN                 0                       128                                                   *:40003                                              *:*
LISTEN                 0                       65535                                                 *:1988                                               *:*
LISTEN                 0                       128                                                   *:40004                                              *:*
LISTEN                 0                       128                                                   *:40005                                              *:*
LISTEN                 0                       128                                                   *:40006                                              *:*
LISTEN                 0                       65535                                                 *:135                                                *:*
LISTEN                 0                       128                                                   *:40007                                              *:*
LISTEN                 0                       128                                                   *:40008                                              *:*
LISTEN                 0                       128                                                   *:40009                                              *:*
LISTEN                 0                       65535                                                 *:2346                                               *:*
LISTEN                 0                       128                                                   *:40010                                              *:*
LISTEN                 0                       65535                                                 *:2347                                               *:*
LISTEN                 0                       128                                                   *:40011                                              *:*
LISTEN                 0                       128                                                   *:40013                                              *:*
LISTEN                 0                       128                                                   *:40014                                              *:*
LISTEN                 0                       65535                                                 *:40015                                              *:*
LISTEN                 0                       128                                                   *:19087                                              *:*
LISTEN                 0                       65535                                                 *:40016                                              *:*

falcon-agent config

{
    "debug": true,
    "hostname": "MyHost",
    "ip": "",
    "plugin": {
        "enabled": false,
        "dir": "./plugin",
        "git": "https://github.com/open-falcon/plugin.git",
        "logs": "./logs"
    },
    "heartbeat": {
        "enabled": true,
        "addr": "falcon-server:6030",
        "interval": 60,
        "timeout": 1000
    },
    "transfer": {
        "enabled": true,
        "addrs": [
            "falcon-server:8433"
        ],
        "interval": 60,
        "timeout": 1000
    },
    "http": {
        "enabled": true,
        "listen": ":1988",
        "backdoor": false
    },
    "collector": {
        "ifacePrefix": ["eth", "em"],
        "mountPoint": []
    },
    "default_tags": {
    },
    "ignore": {
        "cpu.busy": true,
        "df.bytes.free": true,
        "df.bytes.total": true,
        "df.bytes.used": true,
        "df.bytes.used.percent": true,
        "df.inodes.total": true,
        "df.inodes.free": true,
        "df.inodes.used": true,
        "df.inodes.used.percent": true,
        "mem.memtotal": true,
        "mem.memused": true,
        "mem.memused.percent": true,
        "mem.memfree": true,
        "mem.swaptotal": true,
        "mem.swapused": true,
        "mem.swapfree": true
    }
}

请问这个问题是因为什么导致的呢?

@laiwei 请协助排查一下问题,非常感谢

wbcmax commented 3 years ago

找出为什么会出现这种情况的原因了:

之前写过脚本ssh登录到每个agent服务器重启agent,重启过之后端口监控就不正常了

通过终端登录到agent机器上,手动重启,端口监控就恢复正常了……

这是怎么回事呢?

wbcmax commented 3 years ago

ssh远程登录重启命令:

ssh user@host "cd /data/monitor && ./open-falcon restart agent"
[falcon-agent] down
[falcon-agent] 25692

终端登录手动重启命令:

ssh user@host

cd /data/monitor && ./open-falcon restart agent
[falcon-agent] down
[falcon-agent] 25763
wbcmax commented 3 years ago

问题解决了,解决办法:

ssh远程登录重启时加上source /etc/profile

举例:

ssh user@host "source /etc/profile && cd /data/monitor && ./open-falcon restart agent"

具体原因不明……希望 @laiwei 大佬有时间能看下这个问题,如果能给个原因就太好了

laiwei commented 3 years ago

问题解决了,解决办法:

ssh远程登录重启时加上source /etc/profile

举例:

ssh user@host "source /etc/profile && cd /data/monitor && ./open-falcon restart agent"

具体原因不明……希望 @laiwei 大佬有时间能看下这个问题,如果能给个原因就太好了

贴一下你的 /etc/profile 文件内容。 @W-BC0001 或者也可以再贴一下 /var/log/message 等日志的内容。