quynhvuongg commented 4 years ago

Monitoring Network

Các tiêu chí giám sát mạng trên host

Độ khả dụng của máy chủ như CPU,memory,disk,...
Độ khả dụng của các thiết bị mạng(router,switch,...) như CPU, memory, interfaces,...
Độ khả dụng của đường truyền như tốc độ lưu lượng, thông lượng,lỗi, các kết nối như TCP,UDP,ICMP,IP,... (in,out,error)

Metrics liên quan

1. Độ khả dụng của máy chủ

CPU

Node-exporter

node_cpu_second_total

Cadvisor

container_cpu_load_average_10s
container_cpu_system_seconds_total
container_cpu_usage_seconds_total

Memory

Node-exporter

node_memory_MemTotal_bytes
node_memory_MemAvailable_bytes
node_memory_MemFree_bytes
node_memory_Active_bytes
node_memory_Inactive_bytes
node_memory_Cached_bytes
node_memory_Buffers_bytes
node_memory_SwapTotal_bytes
node_memory_SwapFree_bytes
node_memory_SwapCached_bytes

Cadvisor

container_memory_usage_bytes
container_memory_swap
container_memory_cache

Disk

Node-exporter

node_disk_discard_time_seconds_total
node_disk_discarded_sectors_total
node_disk_discards_completed_total
node_disk_discards_merged_total
node_disk_io_now
node_disk_io_time_seconds_total
node_disk_read_bytes_total
node_disk_read_time_seconds_total
node_disk_reads_completed_total
node_disk_reads_merged_total
node_disk_written_bytes_total
node_disk_write_time_seconds_total
node_disk_writes_completed_total
node_disk_writes_merged_total

...

2. Độ khả dụng của các thiết bị mạng

SNMP-exporter (Router)

_cpuusage1min: 1.3.6.1.4.1.9.9.109.1.1.1.1.7

_memomryusage: 1.3.6.1.4.1.9.9.48.1.1.1.5

_listport: 1.3.6.1.2.1.17.1.4.1.2 (ifIndex)

_porttraffic:

1.3.6.1.2.1.2.2.1.10 (ifInOctets)
1.3.6.1.2.1.2.2.1.16 (ifOutOctets)

_portspeed: 1.3.6.1.2.1.2.2.1.5 (ifSpeed)

_descriptioninterfaces: 1.3.6.1.2.1.2.2.1.2 (ifDescr)

_packeterror:

1.3.6.1.2.1.2.2.1.20 (ifOutErrors)
1.3.6.1.2.1.2.2.1.20 (ifInErrors)

...

3. Độ khả dụng của đường truyền

Node-exporter

Network speed:

node_network_speed_bytes

Network traffic:

node_network_transmit_bytes_total
node_network_transmit_drop_total
node_network_transmit_errs_total
node_network_transmit_packets_total
node_network_receive_bytes_total
node_network_receive_carrier_total
node_network_receive_drop_total
node_network_receive_errs_total
node_network_receive_packets_total
node_network_receive_frame_total

node_network_transmit_carrier_total
node_network_carrier_up_changes_total
node_network_carrier_down_changes_total

Network mtu:

node_network_mtu_bytes

Network Socket:

node_sockstat_sockets_used
node_sockstat_TCP_alloc
node_sockstat_TCP_inuse
node_sockstat_TCP_mem
node_sockstat_TCP_mem_bytes
node_sockstat_TCP_orphan
node_sockstat_TCP_tw
node_sockstat_UDPLITE_inuse
node_sockstat_UDPLITE_inuse
node_sockstat_UDP_mem
node_sockstat_UDP_mem_bytes

Network Static:

node_netstat_Icmp_InMsgs
node_netstat_Icmp_OutMsgs
node_netstat_Icmp_InErrors
node_netstat_IpExt_InOctets
node_netstat_IpExt_OutOctets
node_netstat_Tcp_InSegs
node_netstat_Tcp_OutSegs
node_netstat_Tcp_InErrs
node_netstat_Tcp_ActiveOpens
node_netstat_Tcp_PassiveOpens
node_netstat_Tcp_RetransSegs
node_netstat_TcpExt_ListenDrops
node_netstat_TcpExt_ListenOverflows
node_netstat_TcpExt_SyncookiesFailed
node_netstat_TcpExt_SyncookiesRecv
node_netstat_TcpExt_SyncookiesSent
node_netstat_TcpExt_TCPSynRetrans

node_netstat_Udp_InDatagrams
node_netstat_Udp_OutDatagrams
node_netstat_Udp_InErrors

Alert rules

Bộ nhớ sắp đầy ( dung lượng có sẵn < 20%)

- alert: HostOutOfMemory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host out of memory (instance {{ $labels.instance }})"
      description: "Node memory is filling up (< 20% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

 - alert: ContainerMemoryUsage
    expr: (sum(container_memory_usage_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes) BY (instance, name) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container Memory usage (instance {{ $labels.instance }})"
      description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

CPU cao (> 80%)

- alert: HostHighCpuLoad
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host high CPU load (instance {{ $labels.instance }})"
      description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"


  - alert: ContainerCpuUsage
    expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container CPU usage (instance {{ $labels.instance }})"
      description: "Container CPU usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Disk đọc quá nhiều dữ liệu (> 50Mb/s)

- alert: HostUnusualDiskReadRate
    expr: sum by (instance) (irate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host unusual disk read rate (instance {{ $labels.instance }})"
      description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Disk ghi quá nhiều dữ liệu (> 50Mb/s)

 - alert: HostUnusualDiskWriteRate
    expr: sum by (instance) (irate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host unusual disk write rate (instance {{ $labels.instance }})"
      description: "Disk is probably writing too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Không gian Disk sắp đầy

  - alert: HostOutOfDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/rootfs"}  * 100) / node_filesystem_size_bytes{mountpoint="/rootfs"} < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host out of disk space (instance {{ $labels.instance }})"
      description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Thông lượng vào lớn (> 100 Mb/s)

 - alert: HostUnusualNetworkThroughputIn
    expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host unusual network throughput in (instance {{ $labels.instance }})"
      description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Thông lượng ra lớn (> 100Mb/s)

 - alert: HostUnusualNetworkThroughputOut
    expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host unusual network throughput out (instance {{ $labels.instance }})"
      description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Socket
Netstat
SNMP

vanduc95 commented 4 years ago

Em đọc thêm làm thế nào để exporter collect được các metrics như vậy nhé. Dẫn chứng đến code luôn.

quynhvuongg commented 4 years ago

Exporter sử dụng các client library của Golang, Prometheus, Java, Ruby xây dựng Collector để collect metrics.

Để có được metrics như node_netstat.* node_socket.* node_network.* Node-exporter sử dụng Collector để thu thập số liệu từ /proc/net trên Linux OS và xử lý các số liệu đó.

Ex: node_exporter/collector/tcp_linux.go Khởi tạo collector

type tcpStatCollector struct {
    desc   typedDesc
    logger log.Logger
}

func init() {
    registerCollector("tcpstat", defaultDisabled, NewTCPStatCollector)
}

// NewTCPStatCollector returns a new Collector exposing network stats.
func NewTCPStatCollector(logger log.Logger) (Collector, error) {
    return &tcpStatCollector{
        desc: typedDesc{prometheus.NewDesc(
            prometheus.BuildFQName(namespace, "tcp", "connection_states"),
            "Number of connection states.",
            []string{"state"}, nil,
        ), prometheus.GaugeValue},
        logger: logger,
    }, nil
}

Thu thập và xử lý metrics

func (c *tcpStatCollector) Update(ch chan<- prometheus.Metric) error {
    tcpStats, err := getTCPStats(procFilePath("net/tcp"))
    if err != nil {
        return fmt.Errorf("couldn't get tcpstats: %s", err)
    }

    // if enabled ipv6 system
    tcp6File := procFilePath("net/tcp6")
    if _, hasIPv6 := os.Stat(tcp6File); hasIPv6 == nil {
        tcp6Stats, err := getTCPStats(tcp6File)
        if err != nil {
            return fmt.Errorf("couldn't get tcp6stats: %s", err)
        }

        for st, value := range tcp6Stats {
            tcpStats[st] += value
        }
    }

    for st, value := range tcpStats {
        ch <- c.desc.mustNewConstMetric(value, st.String())
    }
    return nil
}

func getTCPStats(statsFile string) (map[tcpConnectionState]float64, error) {
    file, err := os.Open(statsFile)
    if err != nil {
        return nil, err
    }
    defer file.Close()

    return parseTCPStats(file)
}

func parseTCPStats(r io.Reader) (map[tcpConnectionState]float64, error) {
    tcpStats := map[tcpConnectionState]float64{}
    contents, err := ioutil.ReadAll(r)
    if err != nil {
        return nil, err
    }

    for _, line := range strings.Split(string(contents), "\n")[1:] {
        parts := strings.Fields(line)
        if len(parts) == 0 {
            continue
        }
        if len(parts) < 5 {
            return nil, fmt.Errorf("invalid TCP stats line: %q", line)
        }

        qu := strings.Split(parts[4], ":")
        if len(qu) < 2 {
            return nil, fmt.Errorf("cannot parse tx_queues and rx_queues: %q", line)
        }

        tx, err := strconv.ParseUint(qu[0], 16, 64)
        if err != nil {
            return nil, err
        }
        tcpStats[tcpConnectionState(tcpTxQueuedBytes)] += float64(tx)

        rx, err := strconv.ParseUint(qu[1], 16, 64)
        if err != nil {
            return nil, err
        }
        tcpStats[tcpConnectionState(tcpRxQueuedBytes)] += float64(rx)

        st, err := strconv.ParseInt(parts[3], 16, 8)
        if err != nil {
            return nil, err
        }

        tcpStats[tcpConnectionState(st)]++

    }

    return tcpStats, nil
}

Exposes metrics

func (st tcpConnectionState) String() string {
    switch st {
    case tcpEstablished:
        return "established"
    case tcpSynSent:
        return "syn_sent"
    case tcpSynRecv:
        return "syn_recv"
    case tcpFinWait1:
        return "fin_wait1"
    case tcpFinWait2:
        return "fin_wait2"
    case tcpTimeWait:
        return "time_wait"
    case tcpClose:
        return "close"
    case tcpCloseWait:
        return "close_wait"
    case tcpLastAck:
        return "last_ack"
    case tcpListen:
        return "listen"
    case tcpClosing:
        return "closing"
    case tcpRxQueuedBytes:
        return "rx_queued_bytes"
    case tcpTxQueuedBytes:
        return "tx_queued_bytes"
    default:
        return "unknown"
    }
}

quynhvuongg commented 4 years ago

@vtdat Đây ạ

vtdat commented 4 years ago

OK, đọc kĩ và hiểu TẤT CẢ metrics trên, xem cả đoạn code lấy metrics đó.

Hôm sau a sẽ hỏi.

quynhvuongg commented 4 years ago

Chỉ node-exporter thôi hay tất ạ :<

OK, đọc kĩ và hiểu TẤT CẢ metrics trên, xem cả đoạn code lấy metrics đó.

Hôm sau a sẽ hỏi.

vtdat commented 4 years ago

Chỉ node-exporter thôi hay tất ạ :<

OK, đọc kĩ và hiểu TẤT CẢ metrics trên, xem cả đoạn code lấy metrics đó. Hôm sau a sẽ hỏi.

scope là node-exporter trước đi

quynhvuongg / Prometheus