quynhvuongg / Prometheus

0 stars 1 forks source link

Monitoring network #12

Open quynhvuongg opened 4 years ago

quynhvuongg commented 4 years ago

Monitoring Network

Các tiêu chí giám sát mạng trên host

Metrics liên quan

1. Độ khả dụng của máy chủ

image

CPU

Node-exporter

image

Cadvisor

Memory

Node-exporter

image

Cadvisor

Disk

Node-exporter

image

...

2. Độ khả dụng của các thiết bị mạng

SNMP-exporter (Router)

_cpuusage1min: 1.3.6.1.4.1.9.9.109.1.1.1.1.7

_memomryusage: 1.3.6.1.4.1.9.9.48.1.1.1.5

_listport: 1.3.6.1.2.1.17.1.4.1.2 (ifIndex)

_porttraffic:

_portspeed: 1.3.6.1.2.1.2.2.1.5 (ifSpeed)

_descriptioninterfaces: 1.3.6.1.2.1.2.2.1.2 (ifDescr)

_packeterror:

...

3. Độ khả dụng của đường truyền

Node-exporter

Network speed:

Network traffic:

image

image

Network mtu:

Network Socket:

image

Network Static:

image

image

Alert rules

  1. Bộ nhớ sắp đầy ( dung lượng có sẵn < 20%)
- alert: HostOutOfMemory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host out of memory (instance {{ $labels.instance }})"
      description: "Node memory is filling up (< 20% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
 - alert: ContainerMemoryUsage
    expr: (sum(container_memory_usage_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes) BY (instance, name) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container Memory usage (instance {{ $labels.instance }})"
      description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  1. CPU cao (> 80%)
- alert: HostHighCpuLoad
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host high CPU load (instance {{ $labels.instance }})"
      description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: ContainerCpuUsage
    expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container CPU usage (instance {{ $labels.instance }})"
      description: "Container CPU usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  1. Disk đọc quá nhiều dữ liệu (> 50Mb/s)
- alert: HostUnusualDiskReadRate
    expr: sum by (instance) (irate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host unusual disk read rate (instance {{ $labels.instance }})"
      description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  1. Disk ghi quá nhiều dữ liệu (> 50Mb/s)
 - alert: HostUnusualDiskWriteRate
    expr: sum by (instance) (irate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host unusual disk write rate (instance {{ $labels.instance }})"
      description: "Disk is probably writing too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  1. Không gian Disk sắp đầy
  - alert: HostOutOfDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/rootfs"}  * 100) / node_filesystem_size_bytes{mountpoint="/rootfs"} < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host out of disk space (instance {{ $labels.instance }})"
      description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  1. Thông lượng vào lớn (> 100 Mb/s)
 - alert: HostUnusualNetworkThroughputIn
    expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host unusual network throughput in (instance {{ $labels.instance }})"
      description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  1. Thông lượng ra lớn (> 100Mb/s)
 - alert: HostUnusualNetworkThroughputOut
    expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host unusual network throughput out (instance {{ $labels.instance }})"
      description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  1. Socket

  2. Netstat

  3. SNMP

vanduc95 commented 4 years ago

Em đọc thêm làm thế nào để exporter collect được các metrics như vậy nhé. Dẫn chứng đến code luôn.

quynhvuongg commented 4 years ago

Exporter sử dụng các client library của Golang, Prometheus, Java, Ruby xây dựng Collector để collect metrics.

Để có được metrics như node_netstat.* node_socket.* node_network.* Node-exporter sử dụng Collector để thu thập số liệu từ /proc/net trên Linux OS và xử lý các số liệu đó.

image

Ex: node_exporter/collector/tcp_linux.go Khởi tạo collector

type tcpStatCollector struct {
    desc   typedDesc
    logger log.Logger
}

func init() {
    registerCollector("tcpstat", defaultDisabled, NewTCPStatCollector)
}

// NewTCPStatCollector returns a new Collector exposing network stats.
func NewTCPStatCollector(logger log.Logger) (Collector, error) {
    return &tcpStatCollector{
        desc: typedDesc{prometheus.NewDesc(
            prometheus.BuildFQName(namespace, "tcp", "connection_states"),
            "Number of connection states.",
            []string{"state"}, nil,
        ), prometheus.GaugeValue},
        logger: logger,
    }, nil
}

Thu thập và xử lý metrics

func (c *tcpStatCollector) Update(ch chan<- prometheus.Metric) error {
    tcpStats, err := getTCPStats(procFilePath("net/tcp"))
    if err != nil {
        return fmt.Errorf("couldn't get tcpstats: %s", err)
    }

    // if enabled ipv6 system
    tcp6File := procFilePath("net/tcp6")
    if _, hasIPv6 := os.Stat(tcp6File); hasIPv6 == nil {
        tcp6Stats, err := getTCPStats(tcp6File)
        if err != nil {
            return fmt.Errorf("couldn't get tcp6stats: %s", err)
        }

        for st, value := range tcp6Stats {
            tcpStats[st] += value
        }
    }

    for st, value := range tcpStats {
        ch <- c.desc.mustNewConstMetric(value, st.String())
    }
    return nil
}

func getTCPStats(statsFile string) (map[tcpConnectionState]float64, error) {
    file, err := os.Open(statsFile)
    if err != nil {
        return nil, err
    }
    defer file.Close()

    return parseTCPStats(file)
}

func parseTCPStats(r io.Reader) (map[tcpConnectionState]float64, error) {
    tcpStats := map[tcpConnectionState]float64{}
    contents, err := ioutil.ReadAll(r)
    if err != nil {
        return nil, err
    }

    for _, line := range strings.Split(string(contents), "\n")[1:] {
        parts := strings.Fields(line)
        if len(parts) == 0 {
            continue
        }
        if len(parts) < 5 {
            return nil, fmt.Errorf("invalid TCP stats line: %q", line)
        }

        qu := strings.Split(parts[4], ":")
        if len(qu) < 2 {
            return nil, fmt.Errorf("cannot parse tx_queues and rx_queues: %q", line)
        }

        tx, err := strconv.ParseUint(qu[0], 16, 64)
        if err != nil {
            return nil, err
        }
        tcpStats[tcpConnectionState(tcpTxQueuedBytes)] += float64(tx)

        rx, err := strconv.ParseUint(qu[1], 16, 64)
        if err != nil {
            return nil, err
        }
        tcpStats[tcpConnectionState(tcpRxQueuedBytes)] += float64(rx)

        st, err := strconv.ParseInt(parts[3], 16, 8)
        if err != nil {
            return nil, err
        }

        tcpStats[tcpConnectionState(st)]++

    }

    return tcpStats, nil
}

Exposes metrics

func (st tcpConnectionState) String() string {
    switch st {
    case tcpEstablished:
        return "established"
    case tcpSynSent:
        return "syn_sent"
    case tcpSynRecv:
        return "syn_recv"
    case tcpFinWait1:
        return "fin_wait1"
    case tcpFinWait2:
        return "fin_wait2"
    case tcpTimeWait:
        return "time_wait"
    case tcpClose:
        return "close"
    case tcpCloseWait:
        return "close_wait"
    case tcpLastAck:
        return "last_ack"
    case tcpListen:
        return "listen"
    case tcpClosing:
        return "closing"
    case tcpRxQueuedBytes:
        return "rx_queued_bytes"
    case tcpTxQueuedBytes:
        return "tx_queued_bytes"
    default:
        return "unknown"
    }
}
quynhvuongg commented 4 years ago

@vtdat Đây ạ

vtdat commented 4 years ago

OK, đọc kĩ và hiểu TẤT CẢ metrics trên, xem cả đoạn code lấy metrics đó.

Hôm sau a sẽ hỏi.

quynhvuongg commented 4 years ago

Chỉ node-exporter thôi hay tất ạ :<

OK, đọc kĩ và hiểu TẤT CẢ metrics trên, xem cả đoạn code lấy metrics đó.

Hôm sau a sẽ hỏi.

vtdat commented 4 years ago

Chỉ node-exporter thôi hay tất ạ :<

OK, đọc kĩ và hiểu TẤT CẢ metrics trên, xem cả đoạn code lấy metrics đó. Hôm sau a sẽ hỏi.

scope là node-exporter trước đi