megaease / easeprobe

A simple, standalone, and lightweight tool that can do health/status checking, written in Go.
Apache License 2.0
2.18k stars 231 forks source link

Using the increase() function on the Prometheus metric EaseProbe_client_kafka_total returns 0 #442

Closed shengbinxu closed 11 months ago

shengbinxu commented 12 months ago

Environment (please complete the following information):

Describe the bug kafka client probe, the config is:

client:
  - name: Kafka Native Client node0
    driver: "kafka"
    host: "10.25.48.95:9092"
  - name: Kafka Native Client node1
    driver: "kafka"
    host: "10.29.46.118:9092"
  - name: Kafka Native Client node2
    driver: "kafka"
    host: "10.30.84.216:9092"

when node0 down, then it produce prometheus metric:

image

Then monitor it via PromQL:

image

Referring to this article at link I believe it's because the label {status=''down"} lacks an initial value.

I'm unsure how to write PromQL to monitor this metric.

Additionally, the value of total{status='up'} keeps increasing continuously (similar to a counter), while the value of total{status='down'} remains constant (the metric remained at 3 for approximately 15 minutes when the Kafka service was down).

Assuming the probe's detection occurs every 1 minute, the value of total{status='down'} should increase by 1 each time it's probed. If that's the case, using the increase() function should make it easier to detect such anomalies."

I've thought of a compromise, which involves monitoring the increment of the value for total{status='up'}. If the increment is 0, it indicates that the service is down. However, I still feel this method might not be very intuitive.

image image

Expected behavior

  1. total{status='down'} value should increase by 1 each time it's probed
suchen-sci commented 12 months ago

Thanks for your issue. We will try to figure out what happens here...

suchen-sci commented 11 months ago

Hi, i did some test about it. Here's my guess about that happens. EaseProbe_client_kafka_total is a keep increase number, every time easeprobe do a probe, it increase by one. So for you images. The fist image about EaseProbe_client_kafka_total{status!="up"} means your kafka failed at 11:05, since the number not increase anymore, it means it only fails once. EaseProbe_client_kafka_total is keep increaing, prometheus only get EaseProbe_client_kafka_total{status!="up"} to 3. So the line is straight.

EaseProbe_client_kafka_total{status="up"} line is keep increasing (except 11:05), means kafka is alive all the time except when its donw at about 11:05.

For example, in your case, it more like this:

EaseProbe_client_kafka_total=1 status=up
EaseProbe_client_kafka_total=2 status=down
EaseProbe_client_kafka_total=3 status=down
EaseProbe_client_kafka_total=4 status=up
EaseProbe_client_kafka_total=5 status=up
EaseProbe_client_kafka_total=6 status=up
EaseProbe_client_kafka_total=7 status=up
EaseProbe_client_kafka_total=8 status=up
EaseProbe_client_kafka_total=9 status=up
EaseProbe_client_kafka_total=10 status=up
EaseProbe_client_kafka_total=11 status=up
EaseProbe_client_kafka_total=12 status=up
EaseProbe_client_kafka_total=13 status=up
EaseProbe_client_kafka_total=14 status=up
EaseProbe_client_kafka_total=15 status=up

This is what happens here. EaseProbe_client_kafka_total increase every time do the probe. It status can be up or down based on the result of probe.

suchen-sci commented 11 months ago

here's my naive result, i make the kafka server keep failed. image

截屏2023-12-05 12 38 22
shengbinxu commented 11 months ago
image image

Thanks for your reply.

From Figure 1, it can be seen that Kafka had already hung before 10:45 and was restored at 11:05. I made this judgment because the value in the up state did not increase. However, in Figure 2, there are no values before 11:05

 8314 Tue Nov 28 11:03:04 2023 /usr/local/java/jdk/bin/java -Xmx1100m -Xms1100m -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcur

and use linux ps command, the kafka process was restarted at 11:03, so kafka is restart about 11:05. In my impression, Kafka hung up for about 30 minutes before 11:05

At that time, our Kafka encountered a problem and it was uncertain whether the process was hanging. The problem was that the corresponding Linux server could not be logged in. We logged in to the server through the console in the Alibaba Cloud backend and received a prompt that the server was out of memory. Therefore, we restarted the server and manually started the Kafka process (around 11:05)

suchen-sci commented 11 months ago

Pr #455 may be the root cause of this problem.

After a carefully check of EaseProbe code, I suspect that EaseProbe kafka client timeout not work. This results in Kafka continuously waiting for a response. To verify this, I conducted a test by opening a TCP connection on port 9092, accepting the connection but not sending any data. Consequently, EaseProbe remained in a waiting state, indicating that the timeout was not functioning as expected.

This issue likely arises because the timeout is currently set only for the dialing process and not for the reading process from the connection. I have submitted a pull request to address this issue. However, there could be other factors contributing to this problem, and my analysis represents my best hypothesis.

Thank you for bringing up this issue.

shengbinxu commented 11 months ago

@suchen-sci Thank you so much, it's a big thumbs up!

I conducted a test by opening a TCP connection on port 9092, accepting the connection but not sending any data

How to simulate this process?

suchen-sci commented 11 months ago
package main

import (
    "fmt"
    "net"
    "time"
)

func main() {
    l, err := net.Listen("tcp", ":9092")
    if err != nil {
        fmt.Println(err)
        return
    }
    defer l.Close()
    for {
        conn, err := l.Accept()
        if err != nil {
            fmt.Println(err)
        } else {
            go func(conn net.Conn) {
                defer conn.Close()
                time.Sleep(300 * time.Minute)
            }(conn)
        }
    }
}

go run server.go and change easeprobe log level to debug. This naive code just accept the connection and then sleep, while client side of kafka dials success and waits for the response of conn.ReadPartitions().

For now, you can see from debug that it timeouts. If you not update the code, you will not see this debug message, because kafka client is still waiting...

DEBU[2023-12-05T16:34:49+08:00] client: {"name":"Kafka Native Client node0","endpoint":"127.0.0.1:9092","time":"2023-12-05T08:34:39.568444Z","timestamp":1701765279568,"rtt":10001067000,"status":"down","prestatus":"down","message":"Error (client/kafka): read tcp 127.0.0.1:53216-\u003e127.0.0.1:9092: i/o timeout","latestdowntime":"2023-12-05T06:58:59.947317Z","recoverytime":100015423000,"stat":{"since":"2023-12-05T06:33:25.616857Z","total":325,"status":{"1":230,"2":95},"uptime":1150000000000,"downtime":475000000000,"StatusHistory":[{"Status":false,"Message":"read tcp 127.0.0.1:53216-\u003e127.0.0.1:9092: i/o timeout"}],"MaxLen":1,"CurrentStatus":false,"StatusCount":1,"alert":{"strategy":"regular","factor":1,"max":1,"notified":55,"failed":55,"next":2,"interval":1}}} 

"message":"Error (client/kafka): read tcp 127.0.0.1:53216-\u003e127.0.0.1:9092: i/o timeout" in new code.

shengbinxu commented 11 months ago

I have built the latest code package from the main branch and deployed it online. I look forward to the follow-up results